PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues
Key Summary
- •Most people on Earth speak more than one language and often switch languages in the same chat, but AI tools aren’t tested well on this real behavior.
- •PINGPONG is a new benchmark made from real human group chats (2–4 people) where speakers naturally switch between two or even three languages.
- •It covers five language mixes, including ones with different writing scripts, and adds three tasks: question answering, dialogue summarization, and topic classification.
- •The chats are long, messy in a natural way, and include replies to older messages—things machine-generated data rarely captures.
- •Models—both open and proprietary—struggle on PINGPONG, showing big room for improvement, especially on multilingual and trilingual code-switching.
- •Reasoning (showing a step-by-step “thinking trace”) usually helps models do better, more than just giving a few examples in the prompt.
- •Regional models tuned to local languages often beat general multilingual models, proving specialization matters.
- •PINGPONG measures code-mixing intensity (CMI) and how often language switches happen (SPF), giving a fuller picture of multilingual complexity.
- •This benchmark pushes AI toward handling the lively, tangled reality of real group chats used every day around the world.
Why This Research Matters
Real people don’t speak one language at a time in perfect turns; they mix languages and jump between threads. PINGPONG makes AI face that reality so tools can serve the multilingual majority fairly. Better handling of code-switching means clearer customer support, smarter classroom tools, and safer summaries of community discussions. It helps small and low-resource languages be seen and supported rather than ignored. Policymakers and companies can rely on PINGPONG to measure progress honestly, not just on neat, unrealistic tests. Over time, this pushes the whole AI field to build systems that actually work for how people really talk online.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: You know how in a group chat with friends, people talk over each other, jump back to an old message, and sometimes switch languages mid-sentence? Phones handle this fine, but computers often get confused.
🥬 Concept 1 — Natural Language Processing (NLP)
- What it is: NLP is how we teach computers to understand and use human language.
- How it works (recipe): 1) Read the text, 2) Find patterns in words and sentences, 3) Learn from lots of examples, 4) Predict or generate useful responses.
- Why it matters: Without NLP, computers can’t help us search, translate, summarize, or chat. 🍞 Anchor: When you ask a voice assistant, “What’s the weather?” and it answers, that’s NLP at work.
🥬 Concept 2 — Multi-Turn Dialogue
- What it is: A conversation that goes back and forth many times.
- How it works: 1) Person A says something, 2) Person B replies, 3) People keep responding, sometimes to earlier points, 4) The chat forms a long story.
- Why it matters: Without tracking earlier turns, an AI forgets context and gives off-topic answers. 🍞 Anchor: In a 50-message group chat about planning a trip, the AI must remember who’s bringing snacks from 30 messages ago.
🥬 Concept 3 — Code-Switching
- What it is: Switching between two or more languages in the same conversation.
- How it works: 1) Speakers pick whichever word or sentence feels natural, 2) Switch when it’s faster, clearer, or more cultural, 3) Mix languages at word, sentence, or paragraph level.
- Why it matters: Without handling code-switching, AI misses meaning and makes bad guesses. 🍞 Anchor: “Let’s meet jam 3 ya, at the mall, nanti aku bawa snacks.” That’s Indonesian–English code-switching.
Before this work, AI had improved a lot on single-language conversations and even some bilingual tasks. But there was a big gap: real, messy, multi-person code-switching chats weren’t being tested well. Many older datasets were short, clean, or focused on one-on-one talk. Some weren’t public; most were only bilingual; and they rarely captured natural structures like replying to a much older message or having uneven speaker activity (one person dominates while another lurks).
The problem: LLMs today often stumble on real multilingual group chats. They may misunderstand who replied to whom, miss a switch in language, or produce bland summaries that skip key points.
Failed attempts: Researchers tried synthetic (machine-generated) dialogues to fill the gap. But these chats tended to be too neat—perfect turn-taking, equal-length messages, and few long-distance replies. That’s not how humans actually chat.
The gap PINGPONG fills: A modern, open, human-authored benchmark of multi-party, multi-turn, naturally code-switched dialogues across five language combinations, including trilingual and mixed-script cases, plus three downstream tasks (QA, summarization, topic classification) to test real understanding.
Real stakes: If AI can’t follow code-switched chats, it fails many users worldwide—parents and teachers in bilingual communities, customer support agents chatting with code-mixing customers, and teams collaborating across languages.
🥬 Concept 4 — Benchmark
- What it is: A standard test set to see how good different AI models are at the same task.
- How it works: 1) Build a dataset, 2) Define tasks and metrics, 3) Evaluate models fairly, 4) Compare results.
- Why it matters: Without a good benchmark, we can’t tell if models truly improve. 🍞 Anchor: PINGPONG is like a fair exam that checks if AI can handle real multilingual group chats.
🥬 Concept 5 — Multi-Party Dialogue
- What it is: Conversations with 2–4 speakers (or more), not just one-on-one.
- How it works: 1) Many voices, 2) Overlapping topics, 3) Uneven participation, 4) Replies to older messages.
- Why it matters: Without multi-party modeling, AI mixes up who said what and loses threads. 🍞 Anchor: Four friends plan a trip; one sends three messages in a row, another replies to message #7 while the chat is at #30.
🥬 Concept 6 — Multi-Threaded Replies (Replying to older turns)
- What it is: Responding to a message from many turns ago.
- How it works: 1) Point to the earlier message (e.g., a “reply” feature), 2) Continue that mini-topic, 3) Juggle multiple mini-topics at once.
- Why it matters: Without tracking threads, AI summaries get jumbled and QA answers the wrong thing. 🍞 Anchor: “@Alex re: your #12 snack list—I’ll bring chips,” posted at turn #49.
PINGPONG recognizes that the world’s multilingual majority deserves AI that understands their natural way of speaking. It brings authentic, diverse, long, and tangled chats together with tasks that truly test understanding, not just word matching.
02Core Idea
🍞 Hook: Imagine a giant photo album of real group chats where people freely switch languages, jump between threads, and talk like they do every day. Now imagine testing AI on that.
The “Aha!” in one sentence: Build a natural, human-written, multi-party, code-switching dialogue benchmark (PINGPONG) with multiple languages and three tasks, so we can honestly measure and improve how AI handles real multilingual chats.
Three analogies:
- Ping-pong table: Messages bounce back and forth, sometimes reaching back to an earlier serve (old message). PINGPONG tests if AI can keep its eye on the ball across languages.
- City traffic: Multiple lanes (topics) and vehicles (speakers) merging and switching lanes (languages). PINGPONG checks if AI can navigate without crashing.
- Library index: A messy scrapbook of conversations needs a smart indexer (AI) to find facts, summarize stories, and label topics, even when the language switches mid-sentence.
Before vs After:
- Before: Datasets were often short, synthetic, bilingual-only, and too tidy. Models looked good on paper but failed on real chats.
- After: With PINGPONG’s long, human-authored, multi-party, sometimes trilingual chats and three tasks, we see true limits—and real progress when fixes work.
Why it works (intuition):
- Human-authored chats are naturally uneven: variable message lengths, speaker dominance, long-distance replies, and spontaneous switches. These features create the exact challenges models face in the wild.
- Multiple tasks (QA, summarization, topic classification) ensure models can retrieve facts, see the big picture, and categorize content—three complementary lenses on real understanding.
- Diverse languages—including low-resource and mixed-script pairs—expose weaknesses that broad multilingual models often hide.
Building blocks:
- Human crowdsourcing: Native speakers in five language mixes chat for 15 minutes on assigned topics in Discord, using natural code-switching and reply-to features.
- Rich language coverage: Indonesian–English; Sundanese–Indonesian–English; Javanese–Indonesian–English; Hausa–English; Algerian Arabic–Standard Arabic–French.
- Three tasks: Multiple-choice QA (with answerable and clever unanswerable types), 3–5 sentence summaries, and topic labels (Science/Tech, Entertainment, Social/Culture, Education, Daily Life).
- Structure metrics: Code-Mixing Index (CMI) for how much mixing; Switch Point Fraction (SPF) for how often switching happens.
- Fair testing: Standard prompts, consistent formats (JSON outputs), and comparisons across open and proprietary models, with and without reasoning traces.
🥬 Concept 7 — Trilingual Conversations
- What it is: Chats mixing three languages.
- How it works: 1) Speakers choose whichever of the three languages fits best, 2) Switch fluidly at different levels, 3) Keep meaning across scripts and vocab.
- Why it matters: Without handling three languages, AI drops key details and makes wrong links. 🍞 Anchor: “Plan besok: petit déjeuner at 8, terus lanjut meeting jam 10, finish by dhuhr” (French–Indonesian–Arabic elements).
🥬 Concept 8 — Code-Mixing Index (CMI)
- What it is: A number showing how intensely languages are mixed in an utterance.
- How it works: 1) Count words by language, 2) See how balanced the mix is, 3) Higher CMI means heavier mixing.
- Why it matters: Without measuring mixing, we can’t compare how hard different chats are for AI. 🍞 Anchor: A sentence with half English, half Indonesian has higher CMI than one that’s 95% one language.
🥬 Concept 9 — Switch Point Fraction (SPF)
- What it is: The fraction of places where a language switch actually happens.
- How it works: 1) Look between words, 2) Count switch points, 3) Divide by all possible switch positions.
- Why it matters: Without knowing how often switching happens, we miss a key part of difficulty. 🍞 Anchor: “Aku’ll send the file nanti” has multiple switch points around “I’ll” and “file.”
In short, PINGPONG makes the test feel like real life. It doesn’t just check if models know words—it checks if they can follow people as they truly talk.
03Methodology
At a high level: Input (human group chats in five language mixes) → Collect multi-party, code-switched dialogues → Add three task annotations (QA, summaries, topics) → Compute structure metrics (CMI, SPF) → Evaluate many models with shared prompts (with/without reasoning, 0/ few-shot) → Output scores that reveal strengths and weaknesses.
Step-by-step:
- Recruit native speakers and form chat groups
- What happens: For each language mix, a language champion recruits annotators. People are grouped into 2-, 3-, or 4-person teams on Discord. Each group gets a topic.
- Why this step: Native speakers produce natural mixing that machines can’t fake well.
- Example: A 4-person Indonesian–English chat about “smartphones in school” runs for ~15 minutes.
- Collect natural multi-party dialogues
- What happens: Teams chat for 15 minutes, freely mixing languages at word/sentence/paragraph levels. They use Discord’s reply feature to respond to older messages and @-mentions to keep identities anonymous.
- Why this step: Real chats have uneven speaker dominance, variable message lengths, and multi-threaded replies—crucial challenges for AI.
- Example: One speaker sends three quick messages; another replies to message #12 while the chat is at #49.
- Create Question Answering (QA) items
- What happens: For each dialogue, annotators write up to 10 MCQs (up to 5 answerable, 5 unanswerable). Answerable ones require reasoning and sometimes external knowledge; unanswerable ones follow categories like Negation, Antonym, Entity-Swap, Mutual-Exclusion, and Impossible Condition. Option E is “No correct answer.” Questions are written in a designated L1 for that language mix.
- Why this step: QA checks precise understanding and reasoning, not just surface pattern matching.
- Example: Dialogue about AI in classrooms; question asks, “Which policy best reduces cheating with AI?” Choices look plausible, but only one aligns with the chat’s clues.
🥬 Concept 10 — Question Answering (QA)
- What it is: The AI answers questions about a given dialogue.
- How it works: 1) Read chat, 2) Find relevant info, 3) Reason, 4) Pick the best option.
- Why it matters: Without accurate QA, AI can’t help users get facts from long chats. 🍞 Anchor: From a 100-turn chat about a school event, “What time will the bus leave?”
- Write 3–5 sentence dialogue summaries
- What happens: Different annotators summarize each chat in the designated L1. They aim for Coherence, Fluency, Relevance, and Consistency.
- Why this step: Summarization checks whether AI gets the big picture and main points.
- Example: “The group debated phone use in class, agreed on silent mode during lectures, and planned to test a trial policy next week.”
🥬 Concept 11 — Dialogue Summarization
- What it is: Making a short, accurate summary of a long chat.
- How it works: 1) Identify key points, 2) Keep it factual, 3) Write clearly.
- Why it matters: Without good summaries, users drown in long threads. 🍞 Anchor: A 3-sentence recap that a teacher can read in 10 seconds.
- Assign topic labels
- What happens: Each chat is mapped to one label: Science/Tech, Entertainment, Social/Culture, Education, or Daily Life.
- Why this step: Topic classification tests broad understanding and helps organize chats.
- Example: A chat about language customs is labeled Social/Culture.
🥬 Concept 12 — Topic Classification
- What it is: Choosing the best category for a chat.
- How it works: 1) Read clues, 2) Match to known categories, 3) Output one label.
- Why it matters: Without labeling, it’s hard to search or sort thousands of chats. 🍞 Anchor: A soccer fandom chat → Entertainment.
- Measure linguistic complexity (CMI and SPF)
- What happens: Compute CMI (how mixed) and SPF (how often switches occur) for each dialogue.
- Why this step: These metrics reveal how challenging a chat might be for AI.
- Example: AR–DZ–FR shows high CMI; SU–ID–EN shows high SPF, signaling frequent switching.
- Prepare evaluation prompts and formats
- What happens: Build shared prompts for all models. For QA: 0-shot and 1-shot. For summarization: 0, 1, and 3-shot. For topics: 1-shot with an explanation. Outputs must be JSON for consistent parsing.
- Why this step: Fair, apples-to-apples comparisons.
- Example: “Answer with A/B/C/D/E in JSON. If no option is correct, choose E.”
🥬 Concept 13 — Few-Shot Learning (In-Context Examples)
- What it is: Giving the model a few examples in the prompt before it tries the real task.
- How it works: 1) Show 1–3 examples, 2) Let the model copy the pattern, 3) Then answer the target.
- Why it matters: Without examples, some models don’t know the expected style or format. 🍞 Anchor: Show a sample summary before asking for a new one.
- Evaluate many models (global vs regional; with/without reasoning)
- What happens: Test instruction-tuned multilingual models (e.g., Qwen, Gemma, Aya), and regional models (e.g., Sailor2, Sahabat-AI, ALLAM, SILMA). Compare versions that generate explicit reasoning traces vs those that don’t.
- Why this step: Understand what helps most—scale, specialization, or reasoning.
- Example: Qwen3 with “thinking traces” vs Qwen2.5 without them.
🥬 Concept 14 — Reasoning / Thinking Trace
- What it is: The model writes its step-by-step thinking before the final answer.
- How it works: 1) The prompt asks for reasoning, 2) The model drafts a chain of thoughts, 3) It checks consistency, 4) Then gives the answer.
- Why it matters: Without reasoning, models guess more and miss tricky cases. 🍞 Anchor: For an unanswerable QA, the trace helps the model conclude “E: No correct answer.”
- Score with task-appropriate metrics
- QA & Topics → Accuracy (percent correct). Summarization → ROUGE-L (plus METEOR, chrF++, BERTScore) for overlap and quality.
🥬 Concept 15 — Accuracy
- What it is: The percent of answers that are correct.
- How it works: 1) Count correct predictions, 2) Divide by total questions, 3) Higher is better.
- Why it matters: Without accuracy, we can’t tell if QA or topic predictions are reliable. 🍞 Anchor: 60 correct out of 100 → 60% accuracy.
🥬 Concept 16 — ROUGE-L (for summaries)
- What it is: A score that checks how much a summary overlaps with a reference, focusing on longest matching sequences.
- How it works: 1) Compare generated vs reference texts, 2) Find long matching strings, 3) Compute an overlap score.
- Why it matters: Without a summary quality measure, we can’t track improvements. 🍞 Anchor: A summary that reuses key phrases and order from the reference earns higher ROUGE-L.
Secret sauce: Authenticity + structure. Real human chats with reply-to links, mixed scripts, variable lengths, long contexts, and tricky QA (including unanswerables) reveal what models truly can—or cannot—do. JSON outputs and shared prompts make fair comparisons easy.
04Experiments & Results
The test: Can models understand and generate across human, multi-party, code-switched dialogues? We measured three abilities: answering reasoning-heavy and trick QA; summarizing long, tangled chats; and labeling topics correctly.
The competition: We compared multilingual general models (e.g., Qwen2.5/3, Gemma2/3, Aya23) and regional models tailored to local languages (e.g., Sailor2, Sahabat-AI, ALLAM, SILMA). We also toggled reasoning traces and changed shots (0 vs few-shot) to see what helps.
Scoreboard with context:
- Overall difficulty: Many models scored low on QA and topic classification, showing that PINGPONG is a tough, realistic test—more like an A+ exam when most are used to B- quizzes.
- Regional advantage: Regionally tuned models (like Sailor2, Sahabat-AI) often beat general multilingual models on relevant languages, meaning local knowledge matters.
- Reasoning helps: Turning on “thinking traces” generally improved results, especially for answerable QA and sometimes for unanswerable QA (notably in Qwen3). It’s like showing your work in math—fewer careless mistakes.
- Few-shot effects: Adding a few examples didn’t consistently help QA or topic labels, but it did help summarization quality (higher ROUGE-L and related metrics), suggesting examples mainly teach output style and structure.
- Language mix challenge: Trilingual and mixed-script sets (like AR–DZ–FR) remained especially hard, reflecting real-world difficulty.
Concrete highlights (simplified):
- QA: Even strong models struggled, but enabling reasoning in Qwen3 and using regional models improved accuracy. Some language pairs (like SU–ID–EN) saw better relative scores, yet others (like AR–DZ–FR) stayed tough.
- Summarization: ROUGE-L improved with few-shot examples across many models; regional models often led in their regions; reasoning sometimes added small gains.
- Topics: Accuracy lagged behind expectations for many general models, but regional ones often did better in their home turf.
Surprising findings:
- Synthetic chats looked “too perfect” and led to overly optimistic expectations; real chats exposed bigger weaknesses.
- Unanswerable QA was uniquely challenging—models improved most when they had explicit reasoning traces to detect “no correct answer.”
- Human annotators strongly preferred human-written chats over GPT-4o-generated ones for naturalness, confirming the dataset’s authenticity goal.
Takeaway: PINGPONG reliably reveals the gap between today’s LLMs and the messy brilliance of real, multilingual group conversations—and shows which levers (reasoning, regional tuning) help close it.
05Discussion & Limitations
Limitations:
- Language scope: Five language combinations across three regions is strong but not exhaustive. Many code-switching communities remain untested.
- Metrics: CMI and SPF are informative but simplified for under-resourced languages due to tool limitations; finer-grained measures could sharpen insights.
- Non-parallel dialogues: Because conversations differ by topic and content, direct language-to-language difficulty comparisons are tricky.
- Safety/noise: Despite guidance, some slang or edge cases may slip through. Ongoing curation is needed.
Required resources:
- People: Native speakers to collect chats and annotate QA, summaries, and topics.
- Platforms: A chat space (e.g., Discord) with reply-to and anonymization features; an annotation platform.
- Compute: To run evaluations across multiple models with varying prompt settings, plus data storage for long chats.
When NOT to use:
- Single-language, formal documents: If your use case is clean monolingual text, simpler benchmarks fit better.
- Ultra-short exchanges: PINGPONG shines on long, multi-threaded chats; tiny snippets won’t stress its strengths.
- Training data replacement: It’s a benchmark for evaluation first; don’t overfit your model to it.
Open questions:
- How to scale to more languages (e.g., African, SEA, European code-switching) while keeping authenticity and fairness?
- Can improved reasoning or retrieval help with unanswerable QA detection and mixed-script parsing?
- What training interventions (curriculum learning, targeted pretraining, synthetic-then-human mixing) best transfer to real chats?
- How to better evaluate cross-turn, cross-thread coherence beyond ROUGE-L—new metrics for multi-party discourse?
06Conclusion & Future Work
Three-sentence summary: PINGPONG is a human-authored benchmark of multi-party, code-switched dialogues across five language combinations, paired with QA, summarization, and topic tasks. It captures real conversational messiness—long contexts, uneven speakers, and replies to older turns—that synthetic data misses. Evaluations show today’s models often struggle, but reasoning traces and regional tuning help.
Main achievement: Making a natural, structurally rich, open benchmark that honestly tests whether AI can follow the real way multilingual people chat.
Future directions: Expand to more languages and scripts; design new metrics for multi-thread coherence; explore training strategies (reasoning-augmented, retrieval-augmented, regionally adapted) that boost code-switching performance without overfitting. Deeper study of unanswerable QA and mixed-script challenges can guide targeted improvements.
Why remember this: Most of the world code-switches every day. PINGPONG centers their reality, giving us a clear, fair way to build AI that truly listens—and responds—across languages, threads, and people.
Practical Applications
- •Evaluate chatbots for multilingual communities before deployment.
- •Tune customer-support bots to handle mixed-language tickets and follow long threads.
- •Build school tools that summarize bilingual class chats or parent–teacher groups.
- •Improve content moderation by understanding mixed-language slang and context.
- •Help newsrooms auto-summarize community forums with code-switching.
- •Assist health hotlines that receive multilingual, overlapping messages.
- •Search and organize group chats by topic across languages.
- •Stress-test reasoning features (thinking traces) on unanswerable questions.
- •Benchmark regional models for local markets and languages.
- •Guide data collection for future multilingual training with realistic chat patterns.