Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis
Key Summary
- •The paper shows how to control accents in text-to-speech (TTS) by mixing simple, linguistics-based sound-change rules with speaker embeddings.
- •Three big accent rules—flapping, rhoticity, and vowel swaps—are used to steer American ↔ British English pronunciations.
- •A new score called phoneme shift rate (PSR) measures how much the speaker embedding preserves or overrides those rules.
- •Rules plus embeddings make accents sound more authentic without hurting naturalness, based on an automatic naturalness rater (UTMOS).
- •An accent classifier and accent-embedding similarity both confirm that rules push speech toward the target accent.
- •PSR reveals that embeddings sometimes “pull back” pronunciations, showing accent and identity are entangled.
- •Vowel correspondences help the most, rhoticity helps accent similarity a lot, and flapping adds smaller but useful gains.
- •The method works across different voices: some voices already sound quite British and still benefit from rules; others rely more on rules to sound right.
- •The study provides a clear, interpretable way to tune accents and a tool to study disentanglement in speech generation.
Why This Research Matters
Clear accent control helps people hear and be heard the way they expect, making tools like assistants and audiobooks feel more local and friendly. Teachers and learners can use targeted rules to practice specific accent features, like dropping r’s or changing key vowels. Media creators can localize content so characters sound right for their setting without re-recording actors. Companies can match customer accents for smoother support calls while keeping the same trusted voice identity. Researchers gain a transparent way to test if models keep accent separate from voice, improving fairness and customization. Accessibility tools can better fit users’ listening comfort and comprehension by adjusting accent strength. Overall, this approach makes speech technology more inclusive, controllable, and reliable.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: You know how some friends say “wa-der” and others say “wa-tuh” for “water”? That’s because accents change how we pronounce sounds, even when we read the same sentence.
🥬 The Concept (Text-to-Speech, TTS): TTS is a computer system that turns written words into spoken voice.
- How it works: 1) Read the text, 2) Turn text into sound instructions (phonemes), 3) Use a voice model to speak them out loud.
- Why it matters: Without TTS, computers can’t talk to us in audiobooks, GPS, or voice assistants. 🍞 Anchor: When your map app says “Turn right,” that’s TTS making text into speech.
🍞 Hook: Imagine you’re the director telling an actor, “Do this line with a British accent.” You want the exact style you asked for.
🥬 The Concept (Accent Control): Accent control means steering TTS to sound like a certain accent on purpose.
- How it works: The model gets extra signals that push its pronunciation style toward, say, American or British.
- Why it matters: Without accent control, voices can sound randomly mixed or not match what users need. 🍞 Anchor: A cartoon character in London should not sound like they’re from California.
🍞 Hook: Think of a rule like “i before e except after c.” Language has sound rules too.
🥬 The Concept (Linguistic Knowledge): Linguistic knowledge is expert understanding of how sounds and words usually behave in real languages.
- How it works: Linguists document patterns (e.g., which sounds change in certain places) and turn them into rules.
- Why it matters: Without this, accent control is a guessing game instead of a guided process. 🍞 Anchor: Knowing British English often drops the “r” in “car” helps TTS say “cah,” not “car.”
🍞 Hook: When you teach a robot to talk, it learns patterns from lots of examples.
🥬 The Concept (Deep Learning): Deep learning is a way computers learn by finding patterns in big piles of examples.
- How it works: Feed recordings and transcripts into a model; it adjusts tiny knobs (weights) to sound right.
- Why it matters: Without deep learning, TTS wouldn’t reach today’s naturalness. 🍞 Anchor: A model learns that “th” in “the” sounds different from “th” in “thin” by hearing many examples.
🍞 Hook: If your ears can tell voices apart—high vs. low, smooth vs. scratchy—computers can, too.
🥬 The Concept (Speech Features): These are pieces of information in audio that describe pitch, loudness, and sound shapes.
- How it works: The model turns waveforms into numbers that capture voice qualities and pronunciations.
- Why it matters: Without features, the model can’t tell one voice, emotion, or accent from another. 🍞 Anchor: A high-pitched child’s voice vs. a low adult voice are different features the model can read.
🍞 Hook: Ever notice how “bath” can sound like “bæth” or “bahth” depending on where someone is from?
🥬 The Concept (Accent Variation): Accents are systematic ways speech differs across places or groups.
- How it works: Sounds shift in predictable spots (like certain vowels or “r” sounds).
- Why it matters: Without understanding variation, TTS can’t match the accent listeners expect. 🍞 Anchor: In many British accents, “car” loses the final r; in most American accents, it keeps it.
🍞 Hook: A voiceprint can tell who’s speaking; computers use something similar.
🥬 The Concept (Speaker Embeddings): A speaker embedding is a compact digital “voice fingerprint” that captures what a voice sounds like.
- How it works: A neural network turns a short audio sample into a vector that encodes timbre, style—and often accent.
- Why it matters: Without embeddings, TTS can’t clone or switch between specific voices. 🍞 Anchor: Picking “Voice A” vs. “Voice B” in a TTS app is picking different embeddings.
🍞 Hook: Like house rules for a board game, many accents follow common pronunciation rules.
🥬 The Concept (Phonological Rules): These are simple, linguistics-based instructions for how certain sounds change in context.
- How it works: Identify target sounds and positions (like “t” between vowels) and swap them for accent-appropriate sounds.
- Why it matters: Without rules, accent control is a blurry, all-at-once push; rules give precise, targeted steering. 🍞 Anchor: Turning American “waR-der” back to British “wa-ter” applies a “no flapping” rule.
🍞 Hook: Imagine measuring how much a magnet (the embedding) drags a compass needle away from where you point it (the rule).
🥬 The Concept (Phoneme Shift Rate, PSR): PSR measures how often the intended rule-based sound changes get pulled back or undone by the speaker embedding.
- How it works: 1) Count how many phoneme swaps the rules ask for, 2) After synthesis, re-check how many swaps still “need” doing, 3) PSR = remaining swaps / original swaps.
- Why it matters: Without PSR, we can’t tell whether rules or embeddings are in charge when they disagree. 🍞 Anchor: If you told the system to remove final “r” sounds 10 times but 4 “r”s sneak back, PSR = 4/10 = 0.4.
The World Before: TTS could already sound impressively human thanks to deep learning and speaker embeddings. But asking for a specific accent was messy: speaker embeddings bundled accent with other traits (like voice color, emotion, even background noise). The Problem: We wanted more dials. Not just “pick a British-sounding voice,” but “switch these exact sounds that make British accents sound British,” while keeping the same speaker identity. Failed Attempts: Purely data-driven methods can produce nice accents but are hard to interpret or fine-tune. Add-ons like accent labels or transliteration help, but still feel like global nudges rather than precise edits. The Gap: We lacked a clear, linguistic “steering wheel,” and a clean ruler to measure the tug-of-war between rules and embeddings. This Paper’s Move: Use three big, well-known rules—flapping, rhoticity, and vowel correspondences—as clean levers, and invent PSR to measure who wins when rules and embeddings disagree. Real Stakes: Better accent control means more inclusive assistants, clearer education tools, localized media that sounds right, and fairer testing of how well models separate accent from identity.
02Core Idea
🍞 Hook: Think of a dimmer switch and a color filter on a lamp. The dimmer (embedding) sets the vibe; the color filter (rule) sets the hue. You need both to get the scene just right.
🥬 The Concept (Main Innovation): The paper adds simple, linguistically grounded phonological rules to TTS and introduces PSR to measure how those rules interact with speaker embeddings.
- How it works: 1) Convert text to phonemes, 2) Apply accent rules (like unflap t, drop post-vocalic r, swap certain vowels), 3) Feed phonemes and a speaker embedding into TTS, 4) Measure outcomes with accent classifiers, embedding similarity, and PSR.
- Why it matters: Without explicit rules and PSR, accent control stays a black box and we can’t tell how much accent is entangled with speaker identity. 🍞 Anchor: Asking for British “water” becomes “wa-ter,” not “wa-der,” and PSR tells you if the voice embedding tried to sneak the flap back in.
Three Analogies for the Same Idea:
- Chef + Recipe: The chef (embedding) has a personal style; the recipe (rules) says “use British spices.” PSR checks whether the final dish really tastes British or if the chef’s habits dominated.
- GPS + Driver: The GPS (rules) sets the route to “British,” but the driver (embedding) sometimes takes familiar American shortcuts. PSR counts how many turns the driver ignored.
- Costume + Actor: The costume (rules) gives British looks (unflapped t, no final r), but the actor’s natural walk (embedding) still shows. PSR measures how often the walk peeks through the costume.
Before vs After:
- Before: Accent was mostly controlled by choosing a different speaker embedding. Changes were broad, entangled, and hard to tweak.
- After: You get a small set of clear, adjustable knobs (rules). You can boost or tone down specific accent traits while keeping the same speaker. PSR tells you if the knobs are actually working.
Why It Works (Intuition, no equations):
- Accents differ in a few loud, reliable places—like flapping, rhoticity, and key vowels. Hitting those spots carries lots of the accent “feel.”
- Speaker embeddings are powerful, so they sometimes bend outputs back toward their learned style. PSR quantifies that bending.
- Combining explicit rules with data-driven voices gives the best of both: control and naturalness.
Building Blocks (with Sandwich mini-explanations):
-
🍞 Hook: Ever turn “cat” into “ca-?” by not pronouncing the last sound? 🥬 The Concept (Rhoticity): Rhoticity is whether you pronounce post-vocalic r’s (like in “car”).
- How it works: American keeps the r (car → “car”); many British accents drop or soften it (car → “cah”).
- Why it matters: It’s a strong accent cue; missing it confuses listeners about the accent. 🍞 Anchor: “Hard” sounding like “hahd” points to non-rhotic (British-like) pronunciation.
-
🍞 Hook: Say “city” fast—it can sound like “cidy.” 🥬 The Concept (Flapping): Flapping turns t between vowels into a quick tap (like a soft d) in American English.
- How it works: Intervocalic t in unstressed spots becomes [ɾ]. British typically keeps a crisp [t].
- Why it matters: It’s a familiar American hallmark; undoing it helps sound British. 🍞 Anchor: “Water” → American “wa-der,” British “wa-ter.”
-
🍞 Hook: Think of vowel pairs like team jerseys that switch colors across leagues. 🥬 The Concept (Vowel Correspondences): Certain vowel sets regularly differ across accents.
- How it works: Map American vowel choices to British ones in known lexical sets (like TRAP/BATH/GOAT).
- Why it matters: These swaps carry a big chunk of accent identity. 🍞 Anchor: “Bath” → American /bæθ/ vs. British /bɑːθ/.
-
🍞 Hook: If you push a spring and it bounces back, how strong was your push? 🥬 The Concept (PSR again, the meter): PSR is the bounce-back meter for rules.
- How it works: Compare intended rule changes to what still needs changing after synthesis.
- Why it matters: It reveals how much embeddings resist or accept the rules. 🍞 Anchor: If you planned 10 vowel swaps and only 6 stuck, PSR = 4/10 = 0.4 (some bounce-back).
03Methodology
At a high level: Text → American phonemes → apply British rules → TTS with speaker embedding (fixed durations) → Evaluate (accent strength, PSR, naturalness).
Step-by-step (with Sandwich explanations for key pieces):
- Text to Phonemes (G2P)
- 🍞 Hook: Imagine turning a recipe’s words into cooking moves—“chop,” “stir,” “bake.”
- 🥬 The Concept (G2P: Grapheme-to-Phoneme): G2P converts letters into the sequence of speech sounds (phonemes).
- How it works: The Misaki G2P tool reads English text and outputs American English phonemes.
- Why it matters: TTS needs sounds, not just letters; letters can be tricky (think “rough” vs. “though”).
- 🍞 Anchor: “Water” → [w a ɾ ə ɹ] for American English.
- Apply Phonological Rules (American → British)
- 🍞 Hook: Before painting, you tape edges so the color change is clean.
- 🥬 The Concept (Rule Application): Systematically transform the American phoneme sequence using three rule groups.
- How it works: One-to-one substitutions that keep the same number of phoneme symbols and fixed durations.
- Why it matters: Keeping timing and count constant ensures differences come from sounds, not rhythm.
- 🍞 Anchor: “Water” [w a ɾ ə ɹ] → [w ɒ t ə] (unflap t, drop final r, adjust vowel), following the paper’s British mappings.
- Synthesize Speech with Kokoro TTS
- 🍞 Hook: Think of a music player that takes notes (phonemes) and plays them in a chosen instrument (voice embedding).
- 🥬 The Concept (Kokoro TTS + Speaker Embedding): Kokoro-82M takes phonemes, a chosen speaker embedding (voice), and fixed durations to generate audio.
- How it works: Feed either the American or British phoneme sequence plus an embedding (e.g., American or British voice presets). Durations are held constant.
- Why it matters: This isolates the effect of rules vs. embeddings; timing doesn’t muddy the test.
- 🍞 Anchor: Use the same sentence with either American or British phonemes and switch between “Fable” (British) or “afheart” (American) embeddings.
- Measure Accent Strength (Two Ways)
-
🍞 Hook: Like judging a soccer match by both the score and the highlight reel.
-
🥬 The Concept (Accent Classifier Probabilities): Vox-Profile’s accent classifier outputs probabilities for North American, British Isles, or Other accents.
- How it works: Run the synthesized audio through the classifier; read the probability for the target accent.
- Why it matters: Without a score, we can’t tell if the accent change “took.”
-
🍞 Anchor: British probability rises when rules push speech toward British features.
-
🍞 Hook: If two photos look alike, their features match closely.
-
🥬 The Concept (Accent Embedding Similarity): Compare the synthesized audio’s accent embedding to a real-speaker group reference using cosine similarity.
- How it works: Higher similarity to the British reference means the audio clusters with British accents.
- Why it matters: This checks accent style beyond just the classifier’s labels.
-
🍞 Anchor: Similarity jumps from 0.67 to 0.85 with all British rules applied to a British voice embedding.
- Compute PSR (Phoneme Shift Rate)
- 🍞 Hook: Did our edits stick, or did some get undone?
- 🥬 The Concept (PSR Procedure): Count original planned substitutions (N1). Then use a phoneme recognizer (Wav2Vec2Phoneme) on the synthesized audio and count how many substitutions still need doing (N2). PSR = N2/N1.
- How it works: If PSR is low, rules stuck; if high, the embedding pulled outputs back.
- Why it matters: It directly measures rule–embedding interaction at the phoneme level.
- 🍞 Anchor: If you tried 100 British vowel swaps and 31 didn’t stick, PSR = 0.31.
- Check Naturalness (UTMOS)
- 🍞 Hook: After a makeover, you still want the person to look natural, not like a mannequin.
- 🥬 The Concept (UTMOS): An automatic tool that predicts human Mean Opinion Scores (1–5) for how natural speech sounds.
- How it works: Feed in audio; get a score. Higher is better.
- Why it matters: If rules make speech robotic, that’s not useful.
- 🍞 Anchor: Scores stay around 4.4 (NA) and 3.7 (B), showing rules don’t hurt naturalness.
Secret Sauce: Keep everything but the phonemes fixed (same durations, same text) so differences come from exactly two sources—rules and embeddings. Then, use PSR to measure how they tug against each other. This careful setup makes the conclusions trustworthy.
Concrete Example Walkthrough:
- Text: “The water in the bath was warm.”
- American G2P (illustrative): water [w a ɾ ə ɹ], bath [/b æ θ/], warm [w ɔ ɹ m].
- Apply British rules:
- Unflap t: [ɾ] → [t]
- Non-rhotic: drop or vocalize post-vocalic r
- Vowel shifts: TRAP/BATH/GOAT sets
- British-like phonemes (illustrative): water [w ɒ t ə], bath [/b ɑː θ/], warm [w ɔː m].
- Synthesize with British embedding; evaluate with classifier/similarity; compute PSR with recognized phonemes; confirm naturalness with UTMOS.
04Experiments & Results
The Test: The team generated about 33k utterances (≈55.4 hours) using Kokoro TTS with fixed durations and two key levers: speaker embeddings (American vs. British presets) and rule sets (none, single rule, or all three). They measured:
- Accent probability (Vox-Profile): How strongly the speech is judged North American vs. British Isles.
- Accent embedding similarity: How close the audio’s accent embedding is to real British or American reference clusters.
- PSR: How much rule-based phoneme changes survive.
- UTMOS: Whether naturalness stays high.
The Competition: Baselines were the same TTS pipeline with embeddings only (no rules). Then they added individual rules (flapping-only, rhoticity-only, vowel-only) and the full stack (all rules). They also did ablations (all rules minus one) to see which rules matter most.
The Scoreboard (made meaningful):
- Naturalness: UTMOS stayed steady—about 4.4 for North American setups and about 3.7 for British ones—whether rules were applied or not. That’s like changing a player’s jersey without slowing them down.
- North American embedding + All British rules: NA accent probability dropped from 86.5% to 58.8%, while British probability rose to 17.3%. Accent similarity also moved toward British (from -0.05 to 0.21). Translation: rules clearly nudged the voice away from American and toward British.
- British embedding + All British rules: British probability rose from 67.8% to 78.4%. Similarity jumped from 0.67 to 0.85. Translation: already-British voices sound more convincingly British with rules.
- PSR: With British embeddings, PSR fell from 0.775 (no rules) to 0.628 (all rules), meaning more rule changes stuck. Translation: the rules landed more often and the embedding resisted less.
Surprising Findings and Nuances:
- Vowels carry huge weight: Applying vowel correspondences gave the biggest boost in British accent probability and a strong PSR improvement. That’s like changing the melody, not just the rhythm—you really hear it.
- Rhoticity helps similarity a lot: Even when the classifier’s probability didn’t swing wildly, similarity to British accents rose. Dropping r’s clusters the sound with true British speech.
- Flapping alone is subtle: It had a smaller solo effect but contributed when combined with other rules—like adding salt that completes the dish.
- Embedding “pull-back” is real: Table 2 shows many planned changes (N1) didn’t fully stick (N2 > 0), especially for vowels. This confirms embeddings sometimes override rules, which PSR cleanly captures.
- Voice-specific behavior: Some voices (e.g., Daniel) already sound very British and still gain from rules; others (e.g., Fable) rely more on rule guidance. Across voices, PSR consistently drops 15–17% with rules, showing rules help rules “stick.”
Why these numbers matter: Instead of just saying “it sounds British,” the team tied sound changes to measurable shifts in accent probability, accent cluster closeness, and a phoneme-level stickiness score (PSR). Together, these show that linguistics-aware edits improve accentedness reliably, stay natural, and reveal how identity and accent are intertwined in embeddings.
05Discussion & Limitations
Limitations:
- Automated Metrics: Accent probability, similarity, and PSR depend on particular pretrained models (Vox-Profile, Wav2Vec2Phoneme). If these tools are biased or noisy, measurements may wobble.
- Rule Coarseness: The rules target big, well-known accent cues (flapping, rhoticity, major vowel sets). They don’t capture fine dialect details or prosody (intonation, rhythm).
- Accent Coverage: Experiments focus on American↔British mappings. Other accents likely need tailored rule sets and may interact differently with embeddings.
- Fixed Durations: Holding durations constant isolates segmental effects, but real accents also differ in timing and melody; those aspects weren’t explored here.
- Model Specificity: Results are shown with Kokoro-82M and its voice presets; behavior may vary across TTS architectures.
Required Resources:
- A TTS engine that accepts phoneme inputs and speaker embeddings (e.g., Kokoro).
- A G2P tool to produce phoneme sequences (e.g., Misaki G2P).
- Rule implementation code (provided) to transform phonemes.
- Evaluation models for accent probability/similarity and phoneme recognition to compute PSR; plus UTMOS for naturalness.
When NOT to Use:
- If you need fine-grained regional micro-accents (e.g., distinguishing two neighboring British towns), these coarse rules won’t be enough.
- If your main need is emotion/style transfer, not accent, embeddings or style tokens may be more appropriate.
- For languages where prosody and tone carry much of the accent feel, segment-only rules may underperform.
- In code-switching or mixed-language sentences, simple one-accent rules may produce odd results.
Open Questions:
- Can we learn rules automatically and still keep them interpretable? Hybrid systems might discover more accent features while staying steerable.
- How do prosody and timing rules (beyond phonemes) interact with embeddings? Could a “Prosody Shift Rate” be defined?
- Can embeddings be redesigned to disentangle accent from timbre and emotion better, so PSR improves automatically?
- How well does PSR correlate with human judgments across diverse accents and longer passages?
- What’s the best mix of rules for other accent pairs (e.g., Indian English, Australian English) and for non-English languages?
06Conclusion & Future Work
Three-Sentence Summary: This paper shows that adding a few clear, linguistics-based pronunciation rules to TTS lets us steer accents more precisely while keeping speech natural. It introduces phoneme shift rate (PSR) to measure how much speaker embeddings accept or resist those rules. Together, rules and PSR reveal—and help manage—the entanglement between accent and speaker identity.
Main Achievement: Turning accent control from a vague, black-box nudge into a set of simple, targeted levers backed by a new, phoneme-level metric that quantifies rule–embedding interaction.
Future Directions: Expand rule sets to more accents and languages; bring in prosody/timing rules; redesign embeddings for better disentanglement; and validate with diverse recognition models and human listener studies. Explore learning rule candidates automatically while keeping human-readable controls.
Why Remember This: It’s a clean recipe for controllable accents—keep timing fixed, apply big, well-known rules, and measure what sticks. It shows that a little linguistic knowledge goes a long way in modern TTS, making voices both understandable and tunable. And PSR gives us a powerful lens to study and improve how identity and accent mix inside speech generation models.
Practical Applications
- •Customize voice assistants to speak with region-appropriate accents while keeping the same voice identity.
- •Create bilingual or multi-accent audiobooks by toggling targeted rules for characters or chapters.
- •Help language learners practice accent features (e.g., rhoticity or TRAP/BATH vowels) with controllable exercises.
- •Localize videos and games so dubbed voices match the setting’s accent without re-recording.
- •Tune call-center TTS to match a caller’s accent preference for better comprehension and comfort.
- •Diagnose and reduce accent–identity entanglement in TTS models using PSR.
- •Build A/B tests for accent features and choose rule sets that maximize listener satisfaction without hurting naturalness.
- •Prototype new accents for creative projects by composing small rule sets and measuring PSR.
- •Develop accessibility tools that adjust accent strength for clearer understanding by diverse listeners.
- •Benchmark different TTS systems on accent control using the same rules and PSR for fair comparison.