Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Sara Papi; Javier Garcia Gilabert; Zachary Hopton; Vilém Zouhar; Carlos Escolano; Gerard I. Gállego; Jorge Iranzo-Sánchez; Ahrii Kim; Dominik Macháček; Patricia Schmidtova; Maike Züfle

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Beginner

Sara Papi, Javier Garcia Gilabert, Zachary Hopton et al.12/18/2025

arXiv PDF

Key Summary

•This paper builds a big, fair test called Hearing to Translate to check how well different speech translation systems work in the real world.
•It compares three ways to translate speech: classic cascades (ASR then MT/LLM), direct speech-to-text models, and new Speech-LLMs that let the LLM ‘hear’.
•Across 16 benchmarks, 13 language pairs, and 9 tough conditions (like noise, accents, stutters, and long-form audio), cascaded systems are still the most reliable overall.
•Standalone speech foundation models (SFMs) without an LLM fall behind in quality, showing that strong language modeling is crucial.
•Top Speech-LLMs can match cascades in some settings (especially noisy audio, code-switching, and disfluencies), with Voxtral being the strongest Speech-LLM in this study.
•Accent robustness mostly depends on the speech encoder; Seamless is often best here, while many models wobble across dialects.
•Under babble noise, cascades can hallucinate more because the LLM never ‘hears’ the audio; Speech-LLMs are surprisingly sturdier in noise.
•Cascades (and Voxtral) handle long talks best, showing tiny performance drops when audio is very long; some Speech-LLMs collapse on length.
•Human evaluations agree with quality-estimation metrics at the system level, confirming the test suite gives trustworthy comparisons.
•Bottom line: if you need steady, high-quality speech translation today, use a good cascade; if you need noise toughness or mixed-language handling, a top Speech-LLM can shine.

Why This Research Matters

Real conversations are messy: people have accents, speak in noisy places, and talk for a long time. This study shows which speech translation designs hold up under these real-world pressures. If you need steady, top-quality translations now, cascades are your safest bet. If your environment is noisy or speakers mix languages mid-sentence, a leading Speech-LLM can be a better fit. Knowing these trade-offs helps schools, hospitals, call centers, and travel apps choose the right tool. The shared test suite also pushes the field to fix weak spots like bias, name/term accuracy, and long-form memory.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re listening to a friend tell a story in Spanish, and you want to repeat it perfectly in English. You could write down every Spanish word first (like making a script) and then translate it, or you could listen and translate on the fly.

🥬 The Concept (Large Language Models, LLMs): LLMs are computer programs that read and write language very well. How it works: 1) They learn patterns from lots of text, 2) They predict the next word, 3) They follow instructions to stay on-topic, 4) They can translate, summarize, and explain. Why it matters: Without LLMs, translations can be stiff, miss context, and fail to follow user style. 🍞 Anchor: When you ask an LLM to translate a paragraph and keep a polite tone, it knows how to do both.

🍞 Hook: You know how you can recognize your friend’s voice even in a noisy cafeteria? Computers need a ‘hearing system’ to do that too.

🥬 The Concept (Basic Speech Recognition, ASR): ASR turns spoken words into text. How it works: 1) It listens to audio waves, 2) It turns them into features, 3) It guesses letters/words, 4) It fixes them with language rules. Why it matters: Without ASR, you can’t reliably turn speech into text to translate. 🍞 Anchor: Voice typing on your phone uses ASR before any translation happens.

🍞 Hook: Building a house starts with a strong foundation so the whole structure is steady.

🥬 The Concept (Speech Foundation Models, SFMs): SFMs are powerful, general-purpose models that ‘hear’ speech well. How it works: 1) They’re trained on huge amounts of audio, 2) They learn accents and pronunciations, 3) They provide rich audio features to other systems. Why it matters: Without SFMs, systems struggle with accents, noise, and varied speech. 🍞 Anchor: Whisper and Seamless are SFMs many projects use as their listening base.

🍞 Hook: Think of a relay race: the first runner listens, the second runner translates.

🥬 The Concept (Cascaded Speech Translation): A cascade first does ASR, then does translation with MT or an LLM. How it works: 1) Audio → ASR text, 2) Text → Translation by an LLM or MT model, 3) Output polished text. Why it matters: Without this split, it’s harder to use the best specialist at each step and to fix errors at the right place. 🍞 Anchor: Many commercial apps use a cascade: Whisper for ASR, then a strong LLM like Tower+ or Aya to translate.

🍞 Hook: Imagine messaging a friend who speaks another language and instantly replying in their language without writing a transcript.

🥬 The Concept (Direct Speech Translation): Direct models go from speech straight to translated text in one shot. How it works: 1) Audio in, 2) Internal features, 3) Translation out—no visible transcript. Why it matters: Without this, you may add delay and lose prosody info; but direct models need lots of paired speech-translation data. 🍞 Anchor: A direct model might translate a live talk sentence-by-sentence without ever showing the original text.

🍞 Hook: Picture giving the translator super-ears so it can listen and understand at once.

🥬 The Concept (Speech-LLMs): Speech-LLMs plug an audio ‘ear’ into an LLM, letting it listen and translate directly. How it works: 1) An SFM encodes audio, 2) An adapter maps audio features to the LLM’s language space, 3) The LLM reasons and writes the translation. Why it matters: Without integrating speech and language closely, you lose LLM reasoning on what was actually heard. 🍞 Anchor: Voxtral is a Speech-LLM that often rivals cascades in this paper.

The world before: Text LLMs became awesome at translation and following instructions, but speech translation still leaned on cascades: ASR first, then MT/LLM. Direct models existed but were hungry for data and not flexible. Researchers wondered: if LLMs are so good with text, what if they could truly ‘hear’ too?

The problem: Do Speech-LLMs actually beat strong cascades when real speech is messy—noisy rooms, accents, long talks, stutters, names, mixed languages, and emotions? Past studies rarely checked all these tough cases together.

Failed attempts: Direct models looked promising but often collapsed when data was limited or speech got tricky. Plain SFMs without an LLM lacked high-level language skill, producing unpolished or off-target results.

The gap: No comprehensive, apples-to-apples test across many languages and realistic conditions to judge whether adding ears to LLMs truly helps.

Real stakes: This matters for video calls, classrooms, hospitals, travel, and emergencies. We need systems that don’t break when a doctor speaks fast, when a teacher has an accent, or when a noisy crowd is in the background. This paper builds that big, careful test so everyone can see what really works now—and what needs fixing next.

02Core Idea

🍞 Hook: You know how a science fair needs fair rules, the same judges, and many different challenges to truly find the best project?

🥬 The Concept (The Aha!): Build a rigorous test suite that puts Speech-LLMs, cascades, and SFMs through the same tough, real-life challenges to see which one translates speech best. How it works: 1) Gather many benchmarks across languages, 2) Include 9 difficult conditions (noise, accents, stutters, etc.), 3) Use strict, reference-light quality checks, 4) Compare 21 systems fairly. Why it matters: Without a fair, broad test, we might believe hype or cherry-picked demos and choose the wrong tool for real users. 🍞 Anchor: Hearing to Translate is like the Olympics for speech translation models, with events for noise, long talks, and code-switching.

Multiple analogies:

Sports league: Cascades are the seasoned team with strong defense (ASR) and offense (LLM). Speech-LLMs are the exciting new team where one superstar tries to play both roles at once.
Cooking: Cascades are a two-chef kitchen: one chef preps the ingredients (ASR), the other finishes the dish (LLM). Speech-LLMs are the single master chef doing everything from chopping to plating.
School exams: The suite isn’t just one quiz; it’s math, science, language arts, and PE. A model must do well across many subjects (phenomena) to be called top of the class.

Before vs. After:

Before: People guessed that letting LLMs ‘hear’ might automatically be better than splitting the job (cascade). Evidence was scattered.
After: Cascades are still the most reliable overall. However, top Speech-LLMs can tie or win in certain events (noise, code-switching, disfluencies). SFMs alone lag, proving the LLM component (in or outside the model) is crucial.

🍞 Hook: Imagine grading essays without answer keys by checking clarity, correctness, and language use.

🥬 The Concept (Quality Estimation, QE): QE scores rate translations without needing perfect reference answers. How it works: 1) A learned evaluator reads the source and the translation, 2) It predicts a quality score, 3) It penalizes wrong-language outputs strongly, 4) It summarizes system performance. Why it matters: Without QE, many speech tests would be impossible because we lack references for every case. 🍞 Anchor: xCOMET QE S and METRICX QE S are the teachers grading all the systems’ homework here.

Why it works (intuition):

Breadth beats bias: Testing many languages and conditions prevents overfitting to easy sets.
Strictness prevents cheating: Off-target penalties stop models from scoring high with the wrong language.
Apples-to-apples design: Same prompts, default settings, realistic setups—so differences reflect real ability, not trickery.

Building blocks of the suite:

Models: 5 Speech-LLMs (like Voxtral), 4 SFMs (like Whisper, Seamless, Canary, OWSM), and strong cascades built by pairing SFMs with LLMs (Aya, Gemma3, Tower+).
Benchmarks: 16 datasets; 13 language pairs; conditions such as noise, accents, emotion, long-form, names, stutters, and code-switching.
Metrics: QE scores plus special checks (like named-entity accuracy and performance gaps for gender, accent, noise, and length).

Bottom line: The paper’s key contribution is not a new model, but a careful playground where everyone can finally see which approach holds up across the messy, real sounds of human speech.

03Methodology

At a high level: Audio input → (Choose system type: SFM alone, Cascade SFM+LLM, or Speech-LLM) → Translate under shared prompts and default settings → Score with strict quality estimation and special tests → Compare across 16 benchmarks, 13 language pairs, and 9 conditions.

Step A: Pick the systems fairly

What happens: The authors select 21 systems under 32B parameters for broad accessibility: 4 SFMs (Whisper, Seamless, Canary, OWSM), 5 Speech-LLMs (Phi-4-Multimodal, Qwen2-Audio, DeSTA2, Voxtral, Spire), and cascades built by combining SFMs with 3 LLMs (Aya 32B, Gemma3 12B, Tower+ 9B).
Why it exists: Without a diverse lineup, results could be biased toward one brand or size.
Example: Whisper+Aya vs Seamless+Tower+: same idea (cascade) with different parts, testing how choices affect quality.

Step B: Standardize prompts and decoding

What happens: For LLMs, they adapt the official WMT translation prompt; for Speech-LLMs, they tweak it to mention ‘speech’; for SFMs, they set the target (and sometimes source) language since these models aren’t promptable.
Why it exists: Prompts can tilt results. Using consistent instructions keeps it fair.
Example: All LLM-based systems are told: “Translate X to Y, use precise terminology, output only the translation.”

🍞 Hook: You know how teachers test many skills—reading, writing, speaking—to get your true level?

🥬 The Concept (The 9 phenomena): The suite checks conditions that break real systems. How it works: 1) Each dataset targets a challenge (like accents or noise), 2) The suite measures gaps between easy and hard subsets, 3) This shows stability. Why it matters: Without targeted tests, a model might look great on clean speech but fail in the real world. 🍞 Anchor: CommonAccent for accents, NoisyFLEURS for noise, WinoST for gender bias, LibriStutter for disfluencies, etc.

Step C: Curate the benchmarks

What happens: 16 datasets cover clean speech (FLEURS, CoVoST2), gender bias (WinoST, FLEURS gender splits), accents (CommonAccent, ManDi), code-switching (CS-Dialogue, CS-FLEURS), disfluencies (LibriStutter), names/terms (NEuRoparlST), noise (NoisyFLEURS—newly made), emotion (EmotionTalk, mExpresso), and long-form (ACL 60/60, MCIF).
Why it exists: Each benchmark focuses on a known weak spot.
Example: NoisyFLEURS adds babble and ambient noise to clean audio to test true robustness.

🍞 Hook: Imagine grading an essay without an answer sheet by checking if it’s clear, correct, and on-topic.

🥬 The Concept (QE scoring with penalties): QE models predict translation quality without references, and off-target language gets maximum penalty. How it works: 1) The evaluator reads source and output, 2) If the output is in the wrong language, score is slammed, 3) Otherwise, quality features decide the grade. Why it matters: Without off-target penalties, a model could cheat by answering in the wrong language and still score okay. 🍞 Anchor: If asked for English but the model replies in Spanish, it gets a big red X.

Step D: Define special metrics and gaps

What happens: Compute gaps for gender speaker performance, gender coreference (WinoST F1), accent (standard vs non-standard), disfluency (fluent vs stuttered), noise (clean vs noisy), and length (short vs long-form). Named entity and terminology accuracy are exact-match percentages.
Why it exists: A single average score hides where models stumble.
Example: A model may score high overall but have a big disfluency gap, meaning it fails with stutters.

🍞 Hook: Think of real-life talking—people pause, say “um,” or repeat themselves when nervous.

🥬 The Concept (Disfluencies): These are the little bumps in speech like ‘um’ or repeats. How it works: 1) Systems hear non-words, 2) They must ignore or handle them, 3) They still produce smooth, correct translations. Why it matters: Without robustness to disfluencies, live captions and translations become messy and wrong. 🍞 Anchor: LibriStutter tests whether models keep meaning despite stutters.

🍞 Hook: Imagine a friend who flips between English and Spanish mid-sentence.

🥬 The Concept (Code-switching): Speakers mix languages fluidly. How it works: 1) Model detects language changes, 2) Keeps track of who’s who and what’s what, 3) Produces one coherent translation. Why it matters: Without code-switching skill, translations lose names, terms, or switch targets incorrectly. 🍞 Anchor: CS-Dialogue checks Mandarin-English mixing in real conversations.

🍞 Hook: You know how you’d double-check a person’s name in a story so you don’t mix them up?

🥬 The Concept (Named entities and terminology): Keeping names and key terms exact. How it works: 1) Detect important words (people, places, orgs), 2) Translate or copy them accurately, 3) Match references exactly. Why it matters: Without this, medical, legal, and academic talks become risky. 🍞 Anchor: NEuRoparlST grades how often names/terms are exactly right.

🍞 Hook: Picture trying to talk to a friend at a loud birthday party.

🥬 The Concept (Noise robustness): Models must hear through background sounds. How it works: 1) Recognize speech amid babble or ambient noise, 2) Avoid hallucinating words, 3) Preserve meaning. Why it matters: Without noise toughness, real calls, streets, and classrooms break the system. 🍞 Anchor: NoisyFLEURS adds babble and ambient noise to test survival.

🍞 Hook: Think of how ‘water’ sounds different in New York vs London vs Sydney.

🥬 The Concept (Accent variation): People pronounce words differently by region. How it works: 1) Encoders learn varied pronunciations, 2) Models stay steady across accents/dialects, 3) Don’t overfit to the ‘standard’. Why it matters: Without accent stability, many users are unfairly served. 🍞 Anchor: CommonAccent and ManDi measure wobble across accents/dialects.

🍞 Hook: When you hear ‘she’ or ‘he,’ you use context to know who it refers to.

🥬 The Concept (Gender bias in translation): Models may favor stereotypes or one gender form. How it works: 1) Check male vs female speaker performance, 2) Test stereotypical vs anti-stereotypical roles, 3) See if pronouns and roles match context instead of biases. Why it matters: Without fairness, systems misgender and misrepresent people. 🍞 Anchor: WinoST tracks whether models choose gendered words by clue or by stereotype.

🍞 Hook: Reading a whole book is harder than a single page.

🥬 The Concept (Long-form handling): Keeping track over minutes of speech. How it works: 1) Chunk audio smartly, 2) Keep global context, 3) Avoid early mistakes snowballing. Why it matters: Without long-form strength, lectures and meetings fall apart. 🍞 Anchor: ACL 60/60 and MCIF test models on extended talks.

Step E: Run with realistic defaults

What happens: Use standard decoding and default settings. Long-form processing follows each toolkit’s normal practice.
Why it exists: Fancy tuning can hide weaknesses; defaults show out-of-the-box reliability.
Example: Whisper uses its standard chunked long-form pipeline; Seamless lacks a standard long-form path and is tested accordingly.

Step F: Score and analyze

What happens: Compute QE scores (xCOMET QE S, METRICX QE S), apply off-target penalties, compute gaps and exact-match metrics for names/terms, and compare across languages and conditions.
Why it exists: Numbers plus targeted analyses reveal which design choices (encoder, LLM, integration) matter for each challenge.
Example: Accent robustness often traces back to the speech encoder, while gender bias disparity often comes from the LLM decoder.

Secret sauce of the method:

Wide coverage (many datasets and languages) avoids one-trick-pony wins.
Strict off-target penalties stop wrong-language ‘good-looking’ mistakes.
Gap metrics shine light on hidden brittleness.
Human validation confirms the QE metrics’ rankings are trustworthy at the system level.

04Experiments & Results

The test: Measure translation quality and robustness across 16 benchmarks, 13 language pairs, and 9 conditions using strict QE metrics and targeted tests (gaps, name/term accuracy). Why: To know which system you can trust when speech is messy, long, or biased.

The competition: 21 systems—4 SFMs, 5 Speech-LLMs, and 12 cascades—facing the same tasks, prompts, and defaults.

The scoreboard with context:

Generic clean speech (like a standard classroom test): Cascades usually scored at the top—think of them as getting A to A+—because the LLM translator cleans up the ASR transcript very well. Voxtral, the strongest Speech-LLM here, often closed the gap and sometimes beat cascades, showing that tight audio-language integration can pay off.
Gender bias: Most systems had small male–female performance gaps on FLEURS, but WinoST exposed bigger issues: some systems did worse on anti-stereotypical roles (e.g., ‘female engineer’). Cascades with Tower+ were most balanced, hinting that the LLM decoder largely drives the fairness behavior.
Accents: Seamless, used directly or in cascades, was especially strong on CommonAccent. Many Speech-LLMs wobbled a lot across accents. In ManDi (Chinese dialects), cascades tended to favor standard Mandarin, while some SFMs and Speech-LLMs were less biased. Accent strength mainly came from the speech encoder choice.
Code-switching: Voxtral and cascaded Whisper led. Whisper alone lagged them, proving both the encoder and the (LLM) decoder matter for mixed-language talks.
Disfluencies (stutters/ums): Voxtral, DeSTA2, and Whisper cascades were most robust. Some systems (Seamless, OWSM, Phi-4-Multimodal) suffered big drops. Even among Whisper-based Speech-LLMs, integration style made a big difference, so robustness wasn’t just about the encoder.
Named entities and terminology: Cascades with Tower+ (and Canary overall) led on exact matches. High overall QE scores didn’t always mean great name/term accuracy—so the specialized metrics were essential.
Noise: Under babble and ambient noise, everyone got worse, but Speech-LLMs (except Spire) were often sturdier than cascades and SFMs. Why? In cascades, ASR hallucinations can mislead the LLM, which never hears the audio itself; Speech-LLMs can keep listening and avoid amplifying ASR mistakes.
Emotion: Contrary to past expectations, current direct and Speech-LLM systems weren’t better at emotional speech; cascades (and Voxtral) stayed more stable.
Long-form: DeSTA2 and Qwen2-Audio collapsed on very long inputs (huge length gaps). Cascades with OWSM or Canary and Voxtral stayed nearly flat (tiny gaps), showing excellent long-talk reliability. Voxtral’s trick: chunk audio for the encoder but rejoin representations so the LLM truly sees the whole context.

Surprising findings:

Noise is a secret superpower zone for Speech-LLMs: by letting the LLM ‘hear’, they avoid cascading ASR hallucinations.
Long-form remains hard for many Speech-LLMs; architecture and memory planning matter a lot.
High QE scores don’t guarantee correct names/terms; targeted accuracy checks are needed.
Human evaluation agreed with QE metrics at the system level (like a correlation around 0.5), confirming the suite’s overall judgments.

Overall takeaway in plain words: If you need consistent top-tier quality today, pick a good cascade. If your main enemy is noise or code-switching, a top Speech-LLM like Voxtral may serve you best. Don’t use SFMs alone if translation quality really matters—bring in an LLM, whether inside a Speech-LLM or as the cascade’s second step.

05Discussion & Limitations

Limitations:

The suite is big but not everything: it covers 9 key phenomena, yet other real-world quirks (like overlapping speakers or strong reverberation) could be added.
Models change fast: newer versions might shift rankings; this is a snapshot in time.
QE metrics, while validated, still miss fine-grained nuances; human checks help but are small-scale.
Long-form handling depends on each toolkit’s current best practice; different chunking or memory settings might alter results.

Required resources:

To reproduce: public weights for SFMs, LLMs, Speech-LLMs; Hugging Face/ESPnet/NVIDIA NeMo toolchains; compute for inference across 16 datasets.
To deploy: a cascade needs a solid ASR plus a good LLM. A Speech-LLM needs GPU memory sized for longer audio contexts.

When NOT to use which approach:

Don’t rely on SFMs alone if translation fidelity is critical; they trail cascades and Speech-LLMs in overall quality.
Avoid Speech-LLMs with weak long-context handling for lectures and long meetings; use cascades or Voxtral-like designs.
If your environment is very noisy, a cascade may hallucinate more; a strong Speech-LLM can be safer.

Open questions:

Can Speech-LLMs close the long-form gap reliably without huge memory costs?
How can we reduce gender and stereotype bias at the LLM decoder while preserving translation accuracy?
What training or data augmentation best boosts accent robustness without hurting other skills?
Can we design hybrid systems where the LLM both hears audio and reads ASR hypotheses, combining the strengths of both worlds?
How do we make name/term accuracy a first-class training target so specialized domains get safer translations?

06Conclusion & Future Work

Three-sentence summary: This paper introduces Hearing to Translate, a large, careful test suite that fairly compares cascades, Speech-LLMs, and SFMs across many languages and tough, real-world speech conditions. Results show that cascaded systems are still the most reliable overall, while top Speech-LLMs can match or beat them in certain cases like noise and code-switching; SFMs alone fall behind, highlighting the importance of strong language modeling. Human checks line up with automatic quality estimation, confirming the suite’s rankings are trustworthy.

Main achievement: The paper sets a rigorous, shared yardstick for speech translation systems, revealing exactly when letting an LLM ‘hear’ helps—and when a classic ASR+LLM pipeline is the best bet.

Future directions: Build Speech-LLMs with stronger long-context memory, reduce decoder-driven gender bias, harden accent robustness, and integrate names/terms accuracy directly into training. Explore hybrids where the LLM both listens to audio and reads ASR drafts to limit hallucinations in noise while keeping long-form stability.

Why remember this: Choosing the right speech translation approach depends on your battlefield: cascades for steady excellence, Speech-LLMs for noise and mixed-language agility, never SFMs alone if quality counts. As Speech-LLMs mature, this suite will keep the field honest, showing which designs truly help people understand each other, no matter how real-world the speech gets.

Practical Applications

•Choose cascaded ASR+LLM for enterprise meeting translations where long-form stability and overall quality are critical.
•Adopt a top Speech-LLM (like Voxtral) for noisy on-the-go translation apps that must resist background chatter.
•Use Seamless-based pipelines for accent-heavy environments to boost recognition of regional pronunciations.
•Deploy cascades with Tower+ when accurate names and terminology matter (e.g., legal or medical settings).
•Prefer Speech-LLMs for code-switched customer support calls to maintain coherence when languages mix.
•Audit gender bias using WinoST-style tests and select LLMs that minimize stereotype-driven errors.
•Evaluate models with no-reference QE scores and strict off-target penalties when references are scarce.
•For classroom lectures or conferences, use cascades or Voxtral-like designs that preserve long-form context.
•Train or fine-tune with accent-diverse data to reduce wobble across dialects before deployment.
•Add a terminology post-check step for critical domains to catch and correct name/term mismatches.

Version: 1