PRiSM: Benchmarking Phone Realization in Speech Models

Shikhar Bharadwaj; Chin-Jou Li; Yoonjae Kim; Kwanghee Choi; Eunjung Yeo; Ryan Soh-Eun Shim; Hanyu Zhou; Brendon Boldt; Karen Rosero Jacome; Kalvin Chang; Darsh Agrawal; Keer Xu; Chao-Han Huck Yang; Jian Zhu; Shinji Watanabe; David R. Mortensen

PRiSM: Benchmarking Phone Realization in Speech Models

Beginner

Shikhar Bharadwaj, Chin-Jou Li, Yoonjae Kim et al.1/20/2026

arXiv PDF

Key Summary

•PRiSM is a new open-source benchmark that checks how well speech models hear and write down tiny speech sounds called phones.
•It grades models in two ways: by how clean their phonetic transcriptions are (intrinsic) and by how useful those transcripts or hidden features are in real tasks (extrinsic).
•A smarter error score called PFER looks at articulatory features (like voicing or lip rounding) instead of just counting phone mistakes, making comparisons fairer across languages.
•Across many tests, encoder-CTC models were the most stable, and models trained on more and more diverse languages did better, especially on unseen languages.
•Specialized phone recognition models still beat big general audio models (LALMs) at hearing fine phonetic details.
•Hidden representations from large ASR models like Whisper were excellent for downstream probes, even when their raw transcripts weren’t the best.
•Transcripts worked best for tasks like dialect geolocation, while hidden features worked best for pathological speech—so both views are needed.
•PRiSM includes clinical, educational, and multilingual tasks to reveal different strengths and blind spots in models.
•The benchmark shows that judging only transcription accuracy can miss important phonetic abilities that matter in the real world.
•PRiSM releases code, recipes, and datasets to help everyone build speech models with stronger, more universal phonetic skills.

Why This Research Matters

Better phone hearing in machines supports earlier, fairer clinical screenings for speech disorders. Language learners can receive precise, phone-level feedback that helps them improve pronunciation faster. Low-resource and endangered languages gain stronger technology support because models that truly hear phones transfer better to unseen languages. Dialect-aware systems become more accurate and less biased when they recognize real phone patterns instead of guessing popular centers. Large audio models get a clearer path to improve on fine-grained phonetics, making them safer and more inclusive. Researchers and practitioners finally have a shared, open benchmark to compare models fairly and pick the right tools for their tasks.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how a music teacher can hear tiny differences between two notes that sound the same to most people? Speech models try to do something similar with human speech—they try to hear tiny building blocks of sounds called phones.

🥬 Filling (The Actual Concept) What it is: Phone Recognition (PR) is when a computer listens to speech and writes down a sequence of phones—the small sound units that describe how a word is actually pronounced. How it works:

Take the audio wave.
Slice it into short moments in time.
For each moment, predict which phone features (like voiced, rounded, nasal) fit best.
Stitch those predictions into a phone transcript, often using the International Phonetic Alphabet (IPA). Why it matters: If we only write words, we lose many details (like accents or disordered speech). Phones keep the fine-grained sound clues that help across languages and special speech settings. 🍞 Bottom Bread (Anchor) Example: The word “tell” can be pronounced slightly differently in American vs. Scottish English. Phone transcripts capture that difference so the model doesn’t pretend they’re exactly the same.

The World Before Before PRiSM, people mostly judged PR systems by counting how many phones they got wrong. That seems fair—until you notice that phones aren’t just labels; they’re bundles of physical features of how we shape sounds. Two phones can be “close” (like two shades of blue) or “far” (like blue vs. red). Simple phone error counts ignore that closeness. Also, studies used different language sets and different phone inventories, so scores weren’t easy to compare. And we didn’t check how well these systems actually helped real tasks like speech therapy assessment or accent analysis.

🍞 Top Bread (Hook) Imagine two spelling tests: one gives a zero if you wrote “colour” instead of “color,” even though they’re very close, and the other gives partial credit for being close. Which tells you more about your skill?

🥬 Filling (The Actual Concept) What it is: Intrinsic evaluation checks the core skill—how accurate the phone transcripts are. Extrinsic evaluation checks how useful those outputs (or hidden features) are in real tasks. How it works:

Intrinsic: Compare predicted phones to reference labels with a smarter score that looks at phonetic features.
Extrinsic: Feed either the transcripts or the model’s hidden representations into small task models and see how well they do on things like dysarthria assessment, L2 scoring, language ID, or dialect geolocation. Why it matters: A model might look sloppy in plain phone counts but still carry rich, helpful phonetic cues that help downstream jobs. If we never look, we’ll miss that value. 🍞 Bottom Bread (Anchor) Example: A model’s transcript for a child with a speech disorder might not look perfect, but its hidden features can still predict the child’s intelligibility very well.

The Problem

Phone error rates were noisy and hard to compare across studies (different languages, different phone sets, different metrics).
Counting raw phone mistakes misses how “close” sounds are, and it misses phonetic knowledge stored inside the model’s hidden layers.
People assumed that better transcription equals better downstream performance, but that link wasn’t tested broadly.

Failed Attempts

Fixing one metric and adding more datasets helped a bit, but it doesn’t scale because good phonetic labels are scarce.
Focusing only on transcriptions ignores powerful internal representations that many models learn.

🍞 Top Bread (Hook) You know how a chef can taste a soup and know what spices are inside, even if you can’t see them? Models also have hidden layers that store useful flavor—phonetic information—you can’t see in the final transcript.

🥬 Filling (The Actual Concept) What it is: Latent representations are the model’s hidden features across time that encode fine speech details. How it works:

Process audio through a neural encoder.
Produce time-aligned vectors that summarize what’s happening in the sound.
Pool or summarize them to make decisions for different tasks. Why it matters: Transcripts simplify and lose detail; hidden features can keep subtle cues (like timbre or slight articulatory shifts) that boost downstream tasks. 🍞 Bottom Bread (Anchor) Example: Whisper’s hidden features score very high when probed for tasks, even when its phone transcript isn’t the very best.

The Gap We needed one open, reproducible benchmark that (1) fairly compares transcriptions using a feature-aware score and (2) also measures real-world usefulness using both transcripts and hidden features.

Real Stakes

Clinical: Better phone-level listening helps detect and rate speech disorders more fairly and earlier.
Education: More precise feedback helps language learners improve pronunciation.
Multilingual tech: Robust phone skills help new or low-resource languages join the digital world.
Fairness: Models that only know a few big languages can miss or mislabel others. We need evaluations that reward broad coverage.

🍞 Top Bread (Hook) Imagine judging basketball players only by how many free throws they make in practice, but never watching them in a real game. You’d miss who plays best under pressure.

🥬 Filling (The Actual Concept) What it is: PRiSM is a benchmark that grades phone recognition models both on core transcription skill and on real tasks using transcripts and hidden features. How it works:

Standardize transcription scoring using feature-aware comparisons.
Test usefulness in clinical, educational, and multilingual tasks.
Include many model types—from specialized PR systems to large audio-language models (LALMs). Why it matters: Now we can spot blind spots (like weak performance on unseen languages) and make fair, apples-to-apples comparisons. 🍞 Bottom Bread (Anchor) Example: PRiSM shows that encoder-CTC models are steady performers across domains, while big general audio models still miss fine phonetic details.

02Core Idea

🍞 Top Bread (Hook) You know how a school report card has different subjects and also a note from the teacher about how you do in class activities? PRiSM is that two-part report card for phone recognition systems.

🥬 Filling (The Actual Concept) What it is: The key insight is to judge phone recognition models with two lenses at once: (1) a feature-aware transcription score (intrinsic) and (2) real-world task performance using both transcripts and hidden features (extrinsic). How it works:

Use a phonetic-feature-based error score so “near-miss” sounds aren’t punished like total mistakes.
Add downstream probes that take either the transcripts or hidden representations and test clinical, educational, and multilingual tasks.
Compare many model families fairly and reproducibly. Why it matters: If we only look at transcription, we miss hidden phonetic skills. If we only look at downstream tasks, we can’t tell whether weakness comes from the transcript, the features, or the task probe. PRiSM reveals both. 🍞 Bottom Bread (Anchor) Example: A model might not top the transcript score but could be best at predicting dysarthria severity using its hidden features. PRiSM catches that.

Multiple Analogies (same idea, three ways)

Eye exam vs. reading in class: Intrinsic is like checking if you can read letters on the chart (pure vision). Extrinsic is watching you read a story aloud to classmates (practical reading).
Driver’s test: Intrinsic is practicing parking cones in an empty lot; extrinsic is driving in real traffic.
Kitchen: Intrinsic is tasting a single spice; extrinsic is how the whole soup tastes when all spices work together.

🍞 Top Bread (Hook) Imagine scoring art only by how many brushstrokes are “correct” and ignoring how the painting moves people. That would miss the point.

🥬 Filling (The Actual Concept) What it is: Phonetic Feature Error Rate (PFER) is a smart score that compares features of sounds (like voicing or lip rounding) rather than treating phones as unrelated labels. How it works:

For each predicted phone and each true phone, compare their articulatory features.
Count how many feature edits are needed to turn one into the other.
Average across the whole transcript. Why it matters: This respects how sounds relate across languages, so models get partial credit when they’re close. 🍞 Bottom Bread (Anchor) Example: Predicting a voiced [b] instead of voiceless [p] is a smaller mistake than replacing [p] with a nasal [m]—PFER recognizes that.

Before vs. After

Before: We relied on raw phone error counts and many one-off setups, making cross-study comparison tricky.
After: We use a standardized, feature-aware score and a shared suite of downstream tasks using both transcripts and hidden features. Now differences across models and languages are clearer and fairer.

Why It Works (intuition, not equations)

Sounds are bundles of features. Scoring by features keeps the geometry of sound space.
Hidden layers keep acoustic details that transcripts throw away; probing them recovers value.
Testing seen-language variation and truly unseen languages separates memorization from general phonetic skill.

Building Blocks 🍞 Top Bread (Hook) Imagine examining a shell by looking at its pattern (transcript) and also feeling its ridges and weight (hidden features). Both tell part of the story.

🥬 Filling (The Actual Concept) What it is: Transcript Probe (TP) and Representation Probe (RP). How it works:

TP: Turn predicted IPA text into features with a small GRU text model.
RP: Take the model’s final hidden layer, pool over time with attention, then classify or score. Why it matters: Some tasks love clear symbols (like dialectal phone patterns), others love acoustic detail (like speech disorders). Using both catches both. 🍞 Bottom Bread (Anchor) Example: In Hindi dialect geolocation, TP shines because phone sequence patterns carry location clues. In dysarthria severity, RP shines because fine acoustic cues matter.

🍞 Top Bread (Hook) Think of two kinds of translators: one that aligns sound chunks directly to outputs, and one that predicts the next symbol like a storyteller.

🥬 Filling (The Actual Concept) What it is: Encoder-CTC vs. Attention-based Encoder-Decoder (AED). How it works:

Encoder-CTC: An encoder maps audio to time-aligned states, and CTC picks the best path through them.
AED: An encoder summarizes audio; a decoder generates outputs step-by-step with attention. Why it matters: Encoder-CTC tends to be stable and faithful to acoustics; AED can learn patterns that help in unseen settings but may normalize tricky inputs. 🍞 Bottom Bread (Anchor) Example: On noisy or long speech, encoder-CTC often stays steady; AED may drift toward common patterns.

🍞 Top Bread (Hook) Imagine a super-robot that can talk and listen about everything, but doesn’t always notice tiny sound details when you whisper.

🥬 Filling (The Actual Concept) What it is: Large Audio Language Models (LALMs) are big models that handle many audio tasks by reasoning over audio-plus-text, often via prompting. How it works:

Encode audio into internal signals.
Rely on powerful language modeling to produce answers.
Often not strictly time-aligned to the audio’s fine details. Why it matters: Great at many tasks, but today they lag on precise phone hearing, especially for low-resource or dialectal speech. 🍞 Bottom Bread (Anchor) Example: In PRiSM, LALMs underperform specialized PR models on phone transcription and dialect geolocation.

🍞 Top Bread (Hook) You know how learning words from many languages makes it easier to recognize new words later? The same goes for phone sounds.

🥬 Filling (The Actual Concept) What it is: Multilingual training is exposing models to many languages during pretraining and finetuning. How it works:

Pretrain on large, diverse audio.
Finetune on multilingual phone labels, sometimes including pseudo-labels.
Encourage alignment of shared phonetic features. Why it matters: More language diversity builds stronger general phone skills and helps most on unseen languages. 🍞 Bottom Bread (Anchor) Example: ZIPA-CTC-NS trained on pseudo-labeled multilingual data does better on languages it never saw before.

03Methodology

High-level Recipe: Input → Intrinsic Check → Extrinsic Checks (TP and RP) → Scores and Insights

Input: Speech plus gold phone labels (for intrinsic) or task labels (for extrinsic).
Intrinsic (core skill): Compare predicted phone transcripts to reference using PFER across two groups: variation in seen language (regional and non-native English) and unseen languages (45–95+ languages).
Extrinsic (real-world use): Use two probes—TP (reads transcripts) and RP (reads hidden features)—on three areas: pathological speech, L2 learning, and multilingual identification.
Output: Per-task scores + aggregate views to understand model strengths and blind spots.

Step-by-step Details

Intrinsic Evaluation with PFER 🍞 Top Bread (Hook) Imagine you’re matching paint colors—being a shade off is different from picking a totally different color.

🥬 Filling (The Actual Concept) What it is: PFER compares feature bundles of predicted vs. true phones, so “nearby” sounds aren’t penalized like total mismatches. How it works:

Represent each phone by articulatory features (like voiced, place, manner, rounding).
Compute how many feature tweaks are needed to turn predicted into reference, per phone.
Average across the transcript to get a percentage of features wrong. Why it matters: This works across languages and phone sets, making apples-to-apples comparisons more fair. 🍞 Bottom Bread (Anchor) Example: Predicting [d] instead of [t] is one feature off (voicing), smaller than predicting [m] instead of [t] (several features differ).

We cover:

Variation in seen language: TIMIT (American English regions), Speech Accent Archive (many L1 accents speaking English), and L2-ARCTIC Perceived (human-checked L2 English phones).
Unseen languages: DoReCo (low-resource languages with careful phone labels), VoxAngeles (95 languages from a cleaned phonetics archive), Tusom2021 (endangered language data).

Extrinsic Evaluation with Two Probes 🍞 Top Bread (Hook) Think of reading a story (transcript) vs. listening to the music under the words (hidden features). Both can teach you something different.

🥬 Filling (The Actual Concept) What it is: Transcript Probe (TP) and Representation Probe (RP). How it works:

TP: Convert predicted IPA text into characters, run a small bi-directional GRU over it, and pool to make a prediction.
RP: Take the model’s final hidden layer, use attention to pool over time (so important moments count more), then pass through a small MLP. Why it matters: Some tasks benefit from symbol sequences (like dialectal patterns); others need fine acoustic cues (like pathology severity). Using both reveals complementary strengths. 🍞 Bottom Bread (Anchor) Example: For Hindi dialect geolocation, TP beat RP because phone order patterns carry strong location clues.

We evaluate three real-world areas:

Pathological speech: Dysarthria intelligibility (Italian EasyCall; English UASpeech) and child speech disorder detection (UltraSuite).
L2 speech: Native-language (L1) classification of learners (EdAcc; L2-ARCTIC+CMU-ARCTIC) and sentence-level L2 pronunciation scores (Speechocean762).
Multilingual: Language identification (FLEURS-24), dialect geolocation in Hindi (Vaani-Hi), and phone inventory induction (DoReCo).

Models We Test 🍞 Top Bread (Hook) Imagine three kinds of listeners: a careful note-taker (encoder-CTC), a storyteller who predicts next words (AED), and a generalist who knows a bit of everything (LALM).

🥬 Filling (The Actual Concept) What it is: Encoder-CTC models, AED models, and Large Audio Language Models. How it works:

Encoder-CTC (e.g., W2V2Ph family, ZIPA-CTC, ZIPA-CTC-NS, POWSM-CTC): Align audio frames to output symbols with a stable path-finding rule.
AED (POWSM): Encode audio then generate phones step-by-step with attention.
LALMs (Gemini 2.5 Flash, Qwen3-Omni-Instruct): Use prompting to produce outputs; not strictly time-aligned and hard to access hidden features. Why it matters: Architectures trade off acoustic faithfulness vs. pattern modeling; PRiSM compares them under the same umbrella. 🍞 Bottom Bread (Anchor) Example: ZIPA-CTC-NS, an encoder-CTC with multilingual pseudo-labels, is steady and strong on unseen languages.

Extra Analyses (to understand model behavior) 🍞 Top Bread (Hook) Imagine you mute parts of a song and see if a friend can still guess the tune. If they can, they rely on memory patterns; if not, they really need the notes.

🥬 Filling (The Actual Concept) What it is: Phonotactics vs. acoustics masking test. How it works:

Take TIMIT with precise phone timing.
Replace various percentages of phones with silence.
See how PFER changes. A flat line means relying on acoustics; rising errors suggest leaning on phonotactic patterns. Why it matters: It shows whether a model hears what’s truly there vs. what’s likely. 🍞 Bottom Bread (Anchor) Example: Wav2Vec2-based and CTC models stayed more faithful at high masking levels than some AED or CR-CTC variants, indicating stronger reliance on real acoustics.

🍞 Top Bread (Hook) Think of building a new alphabet for a new language by listening carefully—what set of sounds does it really use?

🥬 Filling (The Actual Concept) What it is: Phone Inventory Induction. How it works:

Derive the set of phones a model predicts for a language.
Compare it to the language’s gold phone set from DoReCo.
Score overlap with precision/recall and F1. Why it matters: It tests zero-shot phonetic breadth—can the model discover the right building blocks without prior labels for that language? 🍞 Bottom Bread (Anchor) Example: POWSM-CTC achieved high precision on unseen languages, showing strong control over which phones it outputs.

Secret Sauce

Dual view (transcripts + hidden features) exposes different strengths.
Feature-aware scoring (PFER) makes multilingual comparisons fairer.
Broad task suite (clinical, education, multilingual) prevents overfitting to one niche.
Model family diversity (encoder-CTC, AED, LALM, ASR baselines) offers a complete picture.
Reproducible, open recipes and log-aware aggregation make results trustworthy and extendable.

04Experiments & Results

The Test: What and Why

Intrinsic: Measure how close predicted phones are to gold phones using PFER on two fronts—variation in seen language (accents, L2 English) and unseen languages (45–95+).
Extrinsic: On each downstream task, compare two inputs: TP (transcripts) and RP (hidden features). Tasks include dysarthria severity, child speech disorder detection, L1 classification and L2 scoring, language ID, dialect geolocation, and phone inventory induction.

The Competition

Specialized PR models: W2V2P-LV60/XLSR53 (encoder-CTC finetuned from SSL), ZIPA-CTC and ZIPA-CTC-NS (encoder-CTC from scratch, CR-CTC, plus multilingual pseudo-labels), POWSM (AED), POWSM-CTC (encoder-CTC variant).
LALMs: Gemini 2.5 Flash, Qwen3-Omni-Instruct (prompted zero-shot; hidden states not easily probed).
ASR representation baselines: Whisper, WavLM (for RP only).

Scoreboard with Context

Intrinsic (PFER): • Seen-language variation: Encoder-CTC models and AED beat LALMs by a clear margin. ZIPA-CTC-NS and similar CTC models gave low PFER (around the low teens), while LALMs were much worse, sometimes producing degenerate outputs. • Unseen languages: Multilingual training and exposure helped a lot. ZIPA-CTC-NS stayed strong; AED and CTC were comparable, but LALMs lagged heavily. • Takeaway: Diverse language exposure + stable encoder-CTC architectures = consistently good phonetic transcription.
Extrinsic (Downstream tasks): • Pathological speech: Representation Probes (RP) often won. Whisper’s hidden features were especially strong, like getting an A+ when others got Bs. ZIPA models also did well in Transcript Probes due to compact, reliable vocabularies. • L2 speech: Mixed. W2V2P-XLSR53 did well with TP on L2 tasks, helped by multilingual pretraining. RP from large ASR models also scored high. • Multilingual tasks: TP often outperformed RP for dialect geolocation in Hindi—like using a clean map of phone sequences to find where someone is from. For language ID and phone inventory induction, multilingual CTC models shined, with POWSM-CTC showing very high precision on unseen languages.
LALMs: Generally underperformed specialized PR systems on phone recognition and dialect tasks; showed bias toward high-resource accents and geographical centers (mode collapse), e.g., predicting New Delhi too often.

Surprising Findings

Transcripts vs. hidden features is task-dependent: TP beats RP for dialect geolocation (phone sequence patterns matter), while RP beats TP for dysarthria severity (fine acoustic cues matter).
Encoder-CTC stability: These models behaved like steady listeners across domains. AED’s pattern learning helped in some unseen cases but could normalize away tricky details.
Language diversity matters as much as data size: ZIPA trained on fewer hours but more languages did better on unseen languages than larger-but-less-diverse setups.
Whisper’s hidden features are very strong even when its phone transcripts aren’t top of the chart—proof that representations can carry a lot of phonetic gold.

Concrete Nuggets

ZIPA-CTC-NS delivered among the best intrinsic PFER across settings and solid extrinsic scores.
POWSM-CTC achieved top precision in zero-shot phone inventory induction, suggesting encoder-only CTC helps control the phone set in new languages.
LALMs struggled with unbiased phonetic discrimination and often collapsed to popular locations or accent clusters.

05Discussion & Limitations

Limitations

Dataset coverage is incomplete: some languages, dialects, tones, and speaking styles are still underrepresented. Results may reflect dataset biases.
Phonetic transcription is not a single ground truth: choices of inventory and annotator style matter, and IPA can miss gradient phenomena.
Probes can bias outcomes: TP may overfit superficial cues like length; RP depends on which layer and what pooling/fusion strategy is used.
Default decoding and prompts: With more tuning, some models (including LALMs) might improve on specific tasks beyond what PRiSM reports.

Required Resources

Training and probing need GPUs; RP runs can take up to a few hours per task per model. Some models (like POWSM-CTC) need multi-node training.
Access to pretrained checkpoints and, for some corpora, licensed data is required.

When NOT to Use

If you need perfect word-level transcripts only, without phonetic detail, a pure ASR benchmark may be more relevant.
If your task depends primarily on prosody or intonation beyond segmental phones (e.g., tone languages with rich tonal contours), PR-only views may be insufficient.
If you cannot handle IPA or phone-set mapping in your pipeline, transcript-based PR evaluation may not fit easily.

Open Questions

Better pooling for RP: Would mid-layer fusion or multi-head temporal pooling close TP–RP gaps on tasks like geolocation?
Tones, prosody, and suprasegmentals: How do we integrate richer non-segmental features into PR benchmarks?
Fairness and bias: How can we proactively balance language and dialect coverage to reduce mode collapse and accent bias, especially for LALMs?
Unified decoding: Can we design decoding strategies that adapt per task, balancing acoustic faithfulness and learned phonotactics?
Representation interpretability: What tools best explain which phonetic cues RP actually uses in each task?

06Conclusion & Future Work

3-Sentence Summary PRiSM is a two-part report card for phone recognition: a fair, feature-aware transcription score (intrinsic) plus real-world task tests using transcripts and hidden features (extrinsic). Across many datasets and models, encoder-CTC architectures with broad multilingual exposure are the most stable, and specialized PR systems still beat large audio-language models at fine-grained phonetic perception. The benchmark shows why both transcripts and hidden features matter: different tasks prefer different views.

Main Achievement PRiSM standardizes how we measure phone realization by combining a feature-aware error metric with a broad, open suite of downstream probes—revealing strengths and blind spots that single-metric evaluations miss.

Future Directions

Expand coverage to more languages, dialects, tones, and speaking styles; add prosody- and tone-aware evaluations.
Improve RP pooling and multi-layer fusion; explore adaptive decoding for better acoustic–phonotactic balance.
Develop fairness diagnostics to detect and reduce geographic and accent mode collapse, especially in LALMs.

Why Remember This Because how well a model “hears” phones affects clinical care, language learning, and multilingual access to technology—and PRiSM finally gives us a clear, fair, and practical way to measure and improve that hearing.

Practical Applications

•Screening tools that estimate dysarthria severity from everyday recordings to support clinicians.
•Classroom or app-based pronunciation tutors that highlight which phones learners mispronounce.
•Accent- or dialect-aware ASR that adapts to local phone patterns for better captions and transcripts.
•Bootstrapping phone inventories for new or low-resource languages to aid language documentation.
•Geolocation of dialects for sociolinguistic studies and heritage research using phone sequence patterns.
•Quality control for speech datasets by detecting inconsistent or unlikely phone sequences.
•Model selection guides for practitioners choosing between encoder-CTC vs. AED vs. LALMs for phonetic tasks.
•Bias audits that reveal geographic or accent mode collapse in large audio-language models.
•Adaptive decoding strategies that balance acoustic faithfulness and learned phonotactics per task.
•Curriculum design for multilingual pretraining to maximize generalization to unseen languages.

Version: 1