End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions

Anfeng Xu; Tiantian Feng; Somer Bishop; Catherine Lord; Shrikanth Narayanan

End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions

Intermediate

Anfeng Xu, Tiantian Feng, Somer Bishop et al.1/25/2026

arXiv PDF

Key Summary

•This paper builds one smart system that listens to child–adult conversations and writes what was said, who said it, and exactly when each person spoke.
•Instead of using separate tools that can pass mistakes to each other, the model does everything end-to-end inside one Whisper-based network.
•It learns to output special tokens for start time, who is speaking (child or adult), the words, and the end time—like a tidy recipe it follows every time.
•A tiny extra “diarization head” helps the model tell child vs. adult vs. silence at each moment, making speaker labels and timing more accurate.
•A rule-based “state machine” keeps the output well-formed so it never forgets a timestamp or a speaker tag.
•Silence suppression tells the model to avoid placing timestamps in quiet parts, which sharpens boundaries.
•Across two datasets (Playlogue and ADOS), the new system lowers multi-talker WER compared to strong baselines, meaning fewer mistakes overall and better speaker labeling.
•It often matches or beats separate diarization systems in timing accuracy and clearly outperforms ASR-first pipelines on diarization metrics.
•Child speech remains harder than adult speech, but the joint model still improves both.
•The system enables large-scale, reliable measures like words per minute and speaking rate that support research and clinical insights.

Why This Research Matters

This work makes it much easier and more reliable to study real child–adult conversations at scale. By producing words, speaker roles, and timestamps together in one pass, it avoids the fragile handoffs that often break in separate pipelines. The cleaner, structured transcripts enable accurate measures like words per minute, turn-taking, and response latency that support language development and clinical research. Hospitals and researchers can process more data faster, with fewer manual corrections. Families benefit indirectly from better tools that help track progress and tailor interventions. Over time, this kind of technology can improve our understanding of how children communicate—and how to support them best.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re watching a friendly ping-pong match. It’s easy to follow because you can see who hits the ball and when. Now imagine only hearing the sounds—thwack, pause, thwack—without seeing the players. To understand the game, you need to know both what happened and who did it, plus the timing. That’s exactly what researchers want from recordings of child–adult conversations.

🥬 The Concept (Automatic Speech Recognition, ASR):

What it is: ASR is how computers turn speech into text.
How it works (recipe): 1) Listen to the audio. 2) Break it into sound features. 3) Match sound patterns to likely letters/words. 4) Build a sentence that makes sense. 5) Output the transcript.
Why it matters: Without ASR, we can’t get written transcripts to analyze language development. 🍞 Anchor: When a child says “I want the blue block,” ASR writes those words down correctly so we can count words, measure fluency, and understand language use.

🍞 Hook: You know how in a comic book, each speech bubble shows who is talking? Without the bubble tail, you’d be confused. Recordings are like that—you need to know who said each line.

🥬 The Concept (Speaker Diarization):

What it is: Diarization figures out “who spoke when.”
How it works: 1) Slice audio into little time steps. 2) For each step, decide child, adult, or silence. 3) Merge neighboring steps by the same speaker into turns. 4) Attach labels to each turn.
Why it matters: Without diarization, a transcript is like a pile of quotes without names—useless for studying turn-taking or who said what. 🍞 Anchor: “Hi!” (adult), “Hi.” (child), “How are you?” (adult)—now we can measure response times and patterns.

🍞 Hook: Think of a school relay race. If one runner trips, the next runner gets the baton late, and the whole team slows down. That’s what happens when we use separate systems for diarization and ASR: mistakes pass along.

🥬 The Concept (Cascaded Pipelines and Error Propagation):

What it is: Cascaded pipelines do diarization first and ASR second (or the reverse); errors from the first step mess up the next.
How it works: 1) Step A guesses boundaries or words. 2) Step B uses those guesses as if they’re facts. 3) Any mistake in A becomes bigger in B.
Why it matters: You can end up with wrong speakers, bad timestamps, or both. 🍞 Anchor: If a child’s “um” gets labeled as adult speech, the ASR might split the audio wrong and later mis-assign the whole sentence.

🍞 Hook: Imagine one super organizer that plans the whole party—food, music, and games—to make sure everything fits together.

🥬 The Concept (End-to-End Modeling):

What it is: An end-to-end model learns everything together in one system instead of in separate stages.
How it works: 1) Feed in audio. 2) A shared brain (the encoder) learns useful sound patterns. 3) One decoder writes words, speaker tags, and timestamps. 4) All parts learn together to reduce conflicts.
Why it matters: It avoids handoffs that cause errors and makes timing and speaker labels agree with the words. 🍞 Anchor: The model outputs a neat line like: <start_time> <adult> Hello. <end_time> <start_time> <child> Hi! <end_time>—all in order.

🍞 Hook: Picture a giant library where a librarian (the model) is trained on tons of languages and accents. Whisper is like that librarian for sound.

🥬 The Concept (Whisper Model):

What it is: Whisper is a pretrained encoder–decoder model that’s good at ASR and general speech tasks.
How it works: 1) Encoder turns sound into rich features. 2) Decoder turns features into tokens (text and special markers). 3) Pretraining on huge data helps it handle noisy, varied speech.
Why it matters: Child speech is tricky; Whisper’s broad training gives a strong starting point before fine-tuning on child–adult data. 🍞 Anchor: Fine-tuned Whisper hears a preschooler’s “I fink it’s funny” and still types the intended words with the right speaker tag.

The World Before: Researchers needed transcripts that say who said what and when to measure skills like words per minute, turn-taking, and response times. People used cascaded pipelines—either diarize-then-transcribe or transcribe-then-align—which pass along mistakes (error propagation). Child speech is different from adult speech (pitch, pronunciation, vocabulary), and recordings can be noisy. So timestamps and speaker labels often went wrong.

The Problem: We need accurate words, correct speaker roles (child vs. adult), and reliable start/end times—all together. Doing this with separate modules is brittle and hard to tune.

Failed Attempts: 1) Diarization-first did okay on who-talked-when but cut audio oddly, making ASR struggle. 2) ASR-first wrote words well but mislabeled speakers and mis-timed words during forced alignment. 3) Heuristic fixes helped a bit but weren’t robust.

The Gap: There wasn’t a single model that outputs text, speaker roles, and timestamps together in a stable, structured way for child–adult speech.

Real Stakes: These outputs let clinicians and researchers track language growth, social communication, and conversational rhythm. If we get them wrong, we might misjudge a child’s abilities. If we get them right, we can scale support to many families and studies.

02Core Idea

🍞 Hook: Think of a movie script where every line starts with the character’s name and the time the line starts and ends. If the script writes itself perfectly, filming is easy.

🥬 The Concept (Key Insight):

What it is: Teach one Whisper-based model to write a single, well-ordered stream of tokens that includes start time, speaker role (child/adult), the words, and end time—repeated for each utterance.
How it works: 1) Use Serialized Output Training (SOT) so the decoder always follows the same pattern. 2) Add a diarization head to teach the encoder who is speaking at each frame. 3) Guide decoding with silence suppression so timestamps don’t land in quiet parts. 4) Enforce a state-machine so the token order stays valid.
Why it matters: This prevents the messy handoffs of pipelines, keeps structure correct, and improves both transcription and diarization. 🍞 Anchor: The model emits: <|t_start|> <adult> How are you? <|t_end|> <|t_start|> <child> I’m good! <|t_end|>—clean and complete.

Three Analogies:

Orchestra Conductor: Instead of separate musicians playing off-beat, one conductor keeps words, speakers, and timing in sync.
Lego Set with Instructions: Each block (token) snaps in a set order: time-start, who, words, time-end. No missing pieces.
GPS with Road Rules: The model knows not just where to go (what to write) but also the allowed turns (state-machine). It can’t take an illegal shortcut.

Before vs After:

Before: Cascaded systems guess boundaries then words (or vice versa). Mistakes snowball; timestamps drift; speakers get mislabeled.
After: One network learns words, roles, and times together. The encoder becomes both time-aware and speaker-aware; the decoder follows a strict recipe.

🍞 Hook: You know how good studying means linking facts together so they reinforce each other? That’s why this works.

🥬 The Concept (Why It Works—Intuition):

What it is: Joint learning gives the encoder a shared purpose: be great at timing (when speech happens) and identity (who speaks) while supporting word prediction.
How it works: 1) The diarization head pushes features to separate child/adult/silence clearly. 2) Timestamp tokens push features to align with real boundaries. 3) The decoder turns those features into perfectly ordered outputs. 4) Silence suppression and the state-machine prevent silly mistakes at inference.
Why it matters: The parts help each other: better timing → cleaner words; better speaker cues → fewer role mix-ups; strict decoding → valid structure every time. 🍞 Anchor: On clean clinic audio, start/end times lock onto real pauses, speaker tags are consistent, and the transcript reads like a polished script.

Building Blocks (with sandwich explanations):

🍞 Hook: Imagine a to-do checklist you always follow. 🥬 The Concept (Serialized Output Training, SOT):

What it is: A training method where the output is a single sequence that alternates between special markers and words.
How it works: 1) Start: <|t_start|>. 2) Say who: <child>/<adult>. 3) Write words. 4) End: <|t_end|>. Repeat.
Why it matters: The decoder learns a habit—no missing labels or times. 🍞 Anchor: “<0.7> <child> Hi. <0.9>” then “<1.2> <adult> How are you? <2.0>”.

🍞 Hook: Think of a coach whispering, “That’s the kid speaking now.” 🥬 The Concept (Diarization Head):

What it is: A small add-on that predicts child/adult/silence for each time frame.
How it works: 1) Look at encoder features. 2) Output probabilities per frame. 3) Train with the rest so speaker cues get stronger.
Why it matters: Sharpens the encoder’s sense of who is talking and when. 🍞 Anchor: During a child’s answer, the head fires “child, child, child,” then flips to “silence” at the pause.

🍞 Hook: You don’t paint edges during quiet time. 🥬 The Concept (Silence Suppression):

What it is: During decoding, avoid placing timestamps in predicted silences.
How it works: 1) Find silences from the diarization head. 2) Shrink edges by 0.2s for safety. 3) Downweight timestamp tokens inside these spans.
Why it matters: Fewer boundary mistakes and cleaner segments. 🍞 Anchor: The model waits to place <|t_end|> right after speech ends, not in the middle of a quiet gap.

🍞 Hook: Baking rules: mix → pour → bake → cool—no skipping. 🥬 The Concept (State-Machine-Based Forced Decoding):

What it is: A set of rules that only allows valid next tokens.
How it works: 1) At start, only start tokens are legal. 2) After <|t_start|>, only speaker tags or text (as allowed). 3) After text, allow <|t_end|>. 4) Repeat or finish.
Why it matters: Prevents malformed outputs like missing timestamps. 🍞 Anchor: The model cannot output words before choosing child or adult, so structure stays correct.

Put together, these ideas convert long, messy audio into a neat, labeled, and timed transcript that’s ready for research and clinical use.

03Methodology

At a high level: Audio → Whisper Encoder → (A) Diarization Head (frame labels) + (B) Whisper Decoder with SOT → Structured transcript with speaker tags and start/end timestamps.

Step-by-step (with sandwiches where new concepts appear):

Input and Encoding

What happens: The audio is turned into log-Mel features and fed into Whisper’s encoder, producing time-aligned embeddings that capture sounds, phonemes, and prosody.
Why this step exists: The encoder creates a shared “sound map” both the decoder and diarization head can use. Without it, the rest can’t reason about timing or identity.
Example: A 23-second clip becomes a sequence of embeddings, one every small time step.

Serialized Output Training (SOT) for the Decoder 🍞 Hook: You know how you stack pancakes in the same order—pancake, syrup, pancake, syrup—so the tower stands straight? 🥬 The Concept:

What it is: The decoder is trained to emit tokens in a repeating template: start-time → speaker → words → end-time.
How it works: 1) Teacher-forcing during training shows the correct sequence. 2) Cross-entropy loss learns to predict each next token. 3) The model sees many examples so the pattern becomes second nature.
Why it matters: Without SOT, the decoder may forget timestamps or speaker markers. 🍞 Anchor: For “Adult: Hello. Child: Hi.” the model learns to write “<|t_start|> <adult> Hello. <|t_end|> <|t_start|> <child> Hi. <|t_end|>”.

Diarization Head on the Encoder 🍞 Hook: Like putting name tags on speakers in real time. 🥬 The Concept:

What it is: A small stack of 1D CNNs attached to the final encoder layer that predicts child/adult/silence for each frame.
How it works: 1) Take encoder features per frame. 2) Classify into three classes. 3) Train with cross-entropy (multi-task) alongside the decoder.
Why it matters: It sharpens who/when cues, helping timestamps and speaker tokens. 🍞 Anchor: During an adult’s question, frames say “adult,” then flip to “silence,” then to “child” during the reply.

Multi-Task Loss (ASR + Diarization)

What happens: The total loss is L_total = L_ASR + λ * L_diar. The model jointly learns word tokens, timestamp tokens, and frame-level speaker labels.
Why this step exists: Joint learning ties timing, identity, and words together inside the encoder, reducing conflicts.
Example: If the child’s speech is mis-timed, the diarization loss pushes frames to align better, which later helps the decoder place timestamps.

Diarization-Guided Silence Suppression at Inference 🍞 Hook: Editors cut dead air from a video to make scenes flow. 🥬 The Concept:

What it is: During decoding, the system avoids placing timestamps inside predicted silences.
How it works: 1) Find spans where silence probability is high. 2) Shrink each by 0.2s at both ends to allow natural edges. 3) Downweight timestamp tokens in these spans during beam search.
Why it matters: Prevents drifting timestamps and over-segmentation. 🍞 Anchor: The end timestamp for “Hi.” lands right after the ‘i’ sound finishes, not in the middle of a quiet pause.

State-Machine-Based Forced Decoding 🍞 Hook: Board games have legal moves; you can’t move a rook like a knight. 🥬 The Concept:

What it is: A finite-state machine allows only legal next tokens in each stage of decoding.
How it works: 1) Start state allows start tokens. 2) Next state allows speaker tags. 3) Then text tokens until an end-time is allowed. 4) Repeat or end. Invalid tokens are masked to zero probability.
Why it matters: Stops malformed outputs like missing speaker tags or timestamps. 🍞 Anchor: The model cannot say “Hello” before choosing <child> or <adult>, so every utterance is properly labeled and timed.

Pretraining and Fine-Tuning the Diarization Head

What happens: Before joint training, the diarization head is pretrained (encoder frozen) using a scalar mix of encoder layers to learn good speaker/silence cues. Then, in joint training, the full model is unfrozen (final-layer features feed the head). Finally, the head is fine-tuned alone to polish silence detection.
Why this step exists: Randomly starting the head can be slow and noisy on limited data; pretraining stabilizes learning and improves silence and role accuracy.
Example: After pretraining, the head already knows that high pitch + certain formants mean “child” more often, which helps the joint stage.

End-to-End Example with Realistic Tokens

Audio snippet (2.0s): Adult: “Hi.” (0.1–0.5s), Child: “Hi!” (0.7–0.9s)
Model output: “<0.1> <adult> Hi. <0.5> <0.7> <child> Hi! <0.9>”
What breaks without each piece:
- Without SOT: Might skip <0.1> or <0.5> and jumble order.
- Without diarization head: Might mislabel speaker or blur silence.
- Without silence suppression: Might place <0.5> at 0.6s (drift).
- Without state machine: Might output words before choosing a speaker tag.

Secret Sauce:

SOT gives the decoder a tight script.
The diarization head sculpts encoder features to be both time-aware and role-aware.
Silence suppression keeps boundaries crisp.
The state machine guarantees valid structure. Together, they cut error propagation and produce reliable, ready-to-use, speaker-attributed transcripts.

04Experiments & Results

🍞 Hook: Think of a school tournament where one team has to listen to conversations and keep perfect score: words, who said them, and when. We test different teams and see who wins.

🥬 The Concept (The Test):

What it is: Measure how well the system transcribes words, attributes them to child vs. adult, and marks time boundaries.
How it works: 1) Datasets: Playlogue (natural play) and ADOS (structured clinical sessions). 2) Models: Whisper-small and Whisper-large. 3) Metrics:
- WER (word errors), AER (speaker attribution errors), DER (diarization errors), and mtWER (multi-talker WER that combines them).
Why it matters: We want low errors across words, speaker roles, and times, not just one of them. 🍞 Anchor: If mtWER drops, it’s like getting a better overall grade that counts both correct answers and neat, labeled work.

🍞 Hook: You know how you race your invention against last year’s models to see if it’s truly better?

🥬 The Concept (The Competition):

What it is: Compare against three strong baselines.
How it works:
1. Zero-shot WhisperX pipeline (no fine-tuning).
2. Diarization-first: predict segments + roles, then ASR per segment.
3. ASR-first: SOT-ASR outputs words+roles, then forced alignment adds times.
Why it matters: If the new joint model beats these, it means the integration helps in realistic ways. 🍞 Anchor: On Playlogue and ADOS, the joint model is consistently top or near-top across metrics.

Scoreboard with Context (selected highlights):

Playlogue:
- Whisper-small: Proposed mtWER 37.4% vs. best cascaded baseline 41.4% (ASR-first). That’s like going from a B- to a solid B+.
- Whisper-large: Proposed 34.3% vs. best baseline 38.8% (diarization-first). Even better for the bigger model.
ADOS:
- Whisper-small: Proposed 28.8% vs. best baseline 33.6% (ASR-first). Clear win.
- Whisper-large: Proposed 21.7% vs. best baseline 27.0% (ASR-first). Big gain; like lifting your grade by a full letter.

More context:

WER: The joint model often beats ASR-first pipelines, showing that better timing/role cues also help word accuracy.
AER: Explicit diarization supervision lowers role mix-ups compared to ASR-first.
DER: Joint beats ASR-first (forced alignment is brittle). Against diarization-first, DER is mixed: the joint model is close or better on cleaner ADOS (especially Whisper-large), but slightly behind on noisier Playlogue.

🍞 Hook: You know how children’s voices can be squeakier and less steady? That makes them tougher for machines too.

🥬 The Concept (Child vs. Adult Difficulty):

What it is: Child speech is harder due to pitch, articulation, and variability.
How it works: 1) Compare mtWER by role. 2) Examine subgroups by age and autism severity (ADOS CSS).
Why it matters: Knowing where it struggles helps target improvements. 🍞 Anchor: On ADOS, proposed child mtWER can still be higher than adult, but both improve over baselines; kids with higher severity (CSS) show higher errors, regardless of age.

Surprising/Notable Findings:

State-machine forced decoding eliminated missing-timestamp/speaker-token failures (which were huge—up to 46% in some settings). That alone turns many unusable outputs into clean transcripts.
Pretraining the diarization head matters; random init can lag and even hurt DER on limited data.
Silence suppression consistently reduces timestamp drift and helps DER.

Bottom line: The joint model reliably lowers mtWER across datasets and model sizes, improves AER strongly over ASR-first, and delivers competitive DER—especially in cleaner clinical audio—while producing well-formed, ready-to-use transcripts.

05Discussion & Limitations

Limitations:

Overlapping Speech: The system is trained/evaluated on non-overlapping segments; heavy overlaps could degrade accuracy since the output format assumes turn-by-turn structure.
Noisy, Wild Recordings: In very noisy home/play environments (like Playlogue), timestamp tokens can still drift slightly, raising DER compared to a specialized diarization-first pipeline.
Domain Shift: Child speech varies by age, development, accent, and clinical profile. Extreme shifts (e.g., much younger ages or atypical vocalizations) may need extra fine-tuning.
Token Loops: State-machine decoding nearly removes missing-token errors but can rarely cause decoding loops; safeguards are needed (e.g., max token limits).

Required Resources:

GPU Memory: Whisper-large training and joint objectives need a strong GPU (e.g., 48 GB A6000 in the paper) for feasible batch sizes.
Labeled Data: Training needs transcripts with speaker roles and times; while not enormous, careful curation is key.
Engineering Pipelines: Pretraining the diarization head and enforcing state-machine decoding require light custom code.

When NOT to Use:

Densely Overlapped Dialogues (e.g., multiple people talking at once extensively): Consider overlap-aware or separation-based models.
Ultra-Low Latency Streaming Needs: The current setup operates on up to ~30s chunks; extra work is needed for strict real-time constraints.
Multi-Role Scenarios Beyond Child/Adult: The model is optimized for two roles; new roles require retraining and careful token design.

Open Questions:

Robustness to Overlap: Can SOT and the diarization head be extended to handle overlapping speech explicitly, perhaps with multi-label tokens or separation heads?
Better Timestamping: Could learned boundary detectors or alignment-aware losses tighten DER further in noisy conditions?
Role Generalization: How easily does the approach extend to interviewer–interviewee, teacher–student, or doctor–patient roles?
Low-Resource Adaptation: Can synthetic augmentation or self-training shrink the data needs for new clinics or languages?
Clinical Outcomes: Which automatically extracted conversational metrics best correlate with clinician ratings over time?

Overall, the method brings a big quality-of-life upgrade—clean structure, fewer errors, and one-stop training—while leaving room for advances in overlap handling, streaming, and broader role sets.

06Conclusion & Future Work

Three-Sentence Summary:

This paper presents a single Whisper-based model that jointly transcribes speech, marks who is speaking (child or adult), and adds accurate start/end timestamps in one pass.
It uses a tidy output recipe (SOT), a small diarization head for frame-level cues, silence-aware decoding, and a rule-based state machine to prevent malformed outputs.
Across two datasets, it lowers combined errors (mtWER) versus strong baselines and delivers structured, ready-to-use transcripts that align well with human-derived metrics.

Main Achievement:

Unifying ASR, speaker-role diarization, and timestamping into one reliable, end-to-end framework that cuts error propagation and guarantees well-formed outputs.

Future Directions:

Extend to overlapping speech and more than two roles, and refine timestamp accuracy in very noisy scenes.
Explore streaming-friendly decoding and lightweight adaptation for new clinics, ages, and languages.
Link automated conversational metrics even more directly to clinical assessments and long-term outcomes.

Why Remember This:

It shows that one carefully designed model can replace brittle pipelines, delivering cleaner, speaker-attributed transcripts at scale.
The structured decoding and silence guidance make outputs practical, not just accurate.
This unlocks large-scale, consistent language and interaction measures that can support research and clinical care for children.

Practical Applications

•Automate transcription and speaker labeling for clinical assessments (e.g., ADOS) to reduce manual workload.
•Compute child language metrics (words per minute, utterance length, speaking rate) directly from model outputs.
•Monitor conversational turn-taking and response latency over time for developmental tracking.
•Analyze intervention sessions to quantify changes in expressive language before and after therapy.
•Support large-scale research studies by processing thousands of hours of child–adult audio reliably.
•Build tools for teachers and clinicians to visualize who spoke when and how much in classroom or clinic settings.
•Enable privacy-aware, on-device or batch processing pipelines using Whisper-small vs. Whisper-large trade-offs.
•Assist dataset curation by flagging likely mis-timed segments or role mix-ups for quick human review.
•Adapt to other two-role contexts (e.g., interviewer–interviewee) with minimal retraining.
•Power dashboards that integrate transcripts, roles, and timing for rapid insight into conversational patterns.

Version: 1