🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
No Shortcuts to Culture: Indonesian Multi-hop Question Answering for Complex Cultural Understanding | How I Study AI

No Shortcuts to Culture: Indonesian Multi-hop Question Answering for Complex Cultural Understanding

Intermediate
Vynska Amalia Permadi, Xingwei Tan, Nafise Sadat Moosavi et al.2/3/2026
arXivPDF

Key Summary

  • •This paper builds ID-MoCQA, a new two-step (multi-hop) quiz set about Indonesian culture that makes AI connect clues before answering.
  • •It turns simple one-step questions into harder two-step ones using six clue types like time, place, and commonsense hints.
  • •The dataset is bilingual (Indonesian and English) and has 15,590 carefully checked questions.
  • •A strict quality pipeline mixes human checking with 'LLM-as-a-judge' to filter mistakes and keep questions fair and clear.
  • •Frontier AI models beat the human baseline overall, but they still struggle when the correct answer depends on subtle cultural context.
  • •Models are great at guessing the province from clues, but often miss the final culturally appropriate choice for the situation.
  • •Adding 'think step by step' (Chain-of-Thought) helps some models a little, but not reliably for all types or languages.
  • •Smaller or region-tuned models do well on simple single-hop tests but drop on these multi-hop cultural reasoning tasks.
  • •The study exposes a bias where AIs pick the most famous tradition instead of the right practice for the exact context.
  • •ID-MoCQA becomes an essential benchmark for teaching and testing AIs to reason respectfully and accurately about culture.

Why This Research Matters

When AI understands culture as more than trivia, it can give advice that fits real people’s lives. This reduces awkward or disrespectful suggestions and helps in sensitive areas like weddings, funerals, or health practices. A two-step test like ID-MoCQA makes sure AI connects place and situation before choosing an action. That builds trust in education, travel, customer support, and public services. It also helps include underrepresented regions so AI serves everyone more fairly. Finally, by exposing biases (like picking the most famous tradition), it guides better training to make AI kinder and smarter about culture.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine you’re planning a trip across the Indonesian islands. You don’t just need trivia like “Batik is a cloth.” You need to know who wears what, when it’s polite to give a certain gift, and which celebration happens where. That’s more than facts—it’s fitting your actions to the situation.

🥬 Filling (The Actual Concept)

  • What it is: Cultural knowledge is knowing not only facts about traditions but when, where, and how they are actually used in real life.
  • How it works: You identify the place or group → read the social situation → choose the fitting practice (not just the most famous one).
  • Why it matters: Without this, an assistant might give advice that sounds right on paper but feels wrong or disrespectful in context.

🍞 Bottom Bread (Anchor) Example: Knowing “ulos” is a North Sumatran cloth is a fact; knowing when it’s a suitable gift for a wedding versus a casual visit is cultural understanding.

🍞 Top Bread (Hook) You know how a true-or-false quiz is faster than a mystery puzzle? Many old AI tests about culture were like quick quizzes.

🥬 Filling (The Actual Concept)

  • What it is: Single-hop question answering is when a question can be answered using one direct clue or fact.
  • How it works: Read the question → recall one matching fact → pick the answer.
  • Why it matters: If we only test this way, models can guess using shortcuts (like keyword spotting) without really understanding culture.

🍞 Bottom Bread (Anchor) Question: “Which province is famous for Tor-tor?” That’s one hop: North Sumatra.

🍞 Top Bread (Hook) Think of culture like a recipe: the same ingredient can be used differently depending on the family, the event, or the season.

🥬 Filling (The Actual Concept)

  • What it is: Contextual reasoning means using the situation (who, when, where, why) to choose the right cultural practice.
  • How it works: Spot context clues → connect them to the right place/group → decide what’s appropriate for this moment.
  • Why it matters: Without it, an AI might suggest a grand ceremonial dish for a simple everyday meal.

🍞 Bottom Bread (Anchor) If the question says “casual dinner,” picking a festival-only dish is wrong even if that dish is regionally famous.

🍞 Top Bread (Hook) If two foods both come from Indonesia, one might still be unique to a specific province.

🥬 Filling (The Actual Concept)

  • What it is: Cultural specificity means some practices, items, or customs strongly belong to one region or group.
  • How it works: Find the marker (like a dance, law, or landmark) → map it to its unique province → use that to guide the next choice.
  • Why it matters: Without specific anchors, models confuse similar regions (like Java’s provinces) that share many traditions.

🍞 Bottom Bread (Anchor) “Tor-tor” points to North Sumatra; “Sharia qanun and Wilayatul Hisbah” points to Aceh.

🍞 Top Bread (Hook) You’ve met chatbots that can talk about many topics. They are powered by big pattern-learning machines.

🥬 Filling (The Actual Concept)

  • What it is: Large Language Models (LLMs) are AI systems trained on lots of text to predict and produce helpful language.
  • How it works: Read the input → match it to patterns learned during training → generate the next words.
  • Why it matters: LLMs can store many facts, but we need to check if they can reason about culture, not just recite it.

🍞 Bottom Bread (Anchor) An LLM might know “Bali has Hindu temples,” but does it pick the right Bali practice for a newborn ritual?

The World Before Before this work, cultural QA datasets mostly used single-hop questions. They checked if models knew simple facts (like names, foods, or dances) but not whether they could connect clues and pick the suitable action for a given scenario. That left a big gap between “knowing trivia” and “acting appropriately.”

The Problem Models could look smart by spotting surface hints without truly reasoning. Culture, however, is layered: you often must first figure out the place, time, or community, and only then choose what fits.

Failed Attempts

  • Relying on single-hop tests: encourages shortcut guessing.
  • Direct translation to one language: can lose local terms and nuance.
  • Unchecked auto-generation: can introduce factual errors, especially for comparisons or complex intersections of clues.

The Gap We needed a benchmark that forces two steps: identify the cultural context first, then select the appropriate practice. And it must be bilingual, faithful to local terms, and strictly validated.

Real Stakes

  • Everyday respect: Helping travelers, teachers, or health workers avoid insensitive suggestions.
  • Fair AI: Reducing stereotypes like “traditional equals communal” or “patriarchal norms everywhere.”
  • Inclusion: Spotlighting underrepresented regions (e.g., Papua, Aceh) so AI serves everyone, not just the majority.

02Core Idea

🍞 Top Bread (Hook) Imagine a treasure hunt where the first clue tells you the island, and the second clue tells you the exact hidden spot. You must solve the first to solve the second.

🥬 Filling (The Actual Concept)

  • What it is: The paper’s key idea is to turn simple cultural questions into two-step (multi-hop) puzzles so AIs must connect clues before answering.
  • How it works: Start from high-quality single-hop questions → add a first-hop cultural clue that uniquely points to a province → keep the original cultural choice as the second hop → validate at scale (humans + LLM-as-a-judge) in Indonesian and English.
  • Why it matters: It blocks easy shortcuts and tests real cultural reasoning—context first, then the proper practice.

🍞 Bottom Bread (Anchor) Question becomes: “What cloth should she buy in the region where Tor-tor is danced at important ceremonies?” Step 1: Tor-tor → North Sumatra. Step 2: Pick ulos as the cloth.

Multiple Analogies

  • Detective story: First find the city from a street clue, then find the exact shop that sells the rare item.
  • Cooking show: First identify the cuisine (province) from spices, then choose the right dish for a birthday.
  • Sports playoffs: First figure out which stadium the game is in, then pick the strategy that fits that team’s style.

Before vs After

  • Before: One-step answers let models grab famous facts and guess.
  • After: Two-step chains make models prove they know the place and the practice.

Why It Works (Intuition)

  • Forcing the province hop anchors the model to the right cultural space (reduces confusion across similar regions).
  • Keeping the original IndoCulture options preserves authentic, well-formed cultural choices.
  • Six clue types diversify reasoning (entity, geography, time, commonsense scenario, comparison, intersection), so models can’t memorize one trick.
  • Bilingual questions keep local terms intact and test cross-language reasoning.
  • A multi-stage validation pipeline raises question quality and fairness.

Building Blocks (with Sandwich blocks for key pieces)

🍞 Top Bread (Hook) You know how a board game often needs two moves: position your piece, then make the winning play?

🥬 Filling (The Actual Concept)

  • What it is: Two-hop reasoning is a mini chain: find the province first, answer the cultural question second.
  • How it works: Use clue → map to one province → use province to select the right option.
  • Why it matters: Prevents answering without context.

🍞 Bottom Bread (Anchor) Clue: “largest Buddhist temple and active volcanoes” → Central Java → choose the matching celebration food.

🍞 Top Bread (Hook) Imagine packing your backpack with different tools: a compass, a map, a calendar, a guidebook.

🥬 Filling (The Actual Concept)

  • What it is: Clue types are six styles of hints—commonsense, comparison, entity, geographical, intersection, temporal—that point to the province in different ways.
  • How it works: Each question uses one clue type to identify the province, then asks the original cultural question.
  • Why it matters: Variety stops overfitting to one trick and checks wider cultural reasoning.

🍞 Bottom Bread (Anchor) Entity clue: “Where Cut Nyak Dhien led resistance…” → Aceh → pick the Aceh beverage.

🍞 Top Bread (Hook) Like teachers cross-check each other’s grading to be fair, we need a referee for thousands of questions.

🥬 Filling (The Actual Concept)

  • What it is: LLM-as-a-judge uses strong models to score question quality (accuracy, clarity, structure, language) at scale.
  • How it works: Multiple judges rate each question → combine votes → filter or fix items.
  • Why it matters: Full human review of 22k+ items is too slow; this keeps quality high efficiently.

🍞 Bottom Bread (Anchor) If any judge flags “Significant” issues, the question is rejected. Majority “Acceptable” keeps it.

Result: ID-MoCQA—15,590 bilingual, validated two-hop cultural questions for Indonesia that demand real cultural reasoning.

03Methodology

High-level Flow: Input (IndoCulture single-hop) → Step A: Add first-hop cultural clue (six types) → Step B: Build bilingual two-hop question → Step C: Multi-stage validation (humans + LLM judges + structure checks + language rebalance + naturalness/difficulty) → Output: ID-MoCQA.

Step-by-step (with Sandwich explanations for new pieces)

  1. Collect base single-hop QAs
  • What happens: Start with IndoCulture questions labeled as province-specific (True). These are strong anchors because their cultural elements uniquely belong to one province (e.g., Tor-tor → North Sumatra).
  • Why this step exists: We need reliable seeds so the first hop (province) is meaningful.
  • Example: “Mrs. Gabe wants a traditional cloth. Options: ulos, koffo, lantung.” Province tag: North Sumatra.

🍞 Top Bread (Hook) Like turning a simple road sign into a puzzle path with a clue before the sign.

🥬 Filling (The Actual Concept)

  • What it is: Two-hop expansion creates a province-identifying clue before the original question.
  • How it works: Remove province mentions → add an indirect clue of one type (e.g., entity: Tor-tor) → keep original options → form one combined question.
  • Why it matters: It forces reasoning: clue first, choice second.

🍞 Bottom Bread (Anchor) “What cloth should Bu Gabe buy … in the region where Tor-tor is performed?” → First-hop: North Sumatra → Second-hop: ulos.

  1. Design six clue types and prompts
  • What happens: Create strict templates for commonsense, comparison, entity, geographical, intersection, temporal. Include do’s/don’ts (e.g., no rivers that cross provinces; specify years for population stats; brief scenarios for commonsense).
  • Why this step exists: Clear templates reduce errors like ambiguity, answer leakage, or unverifiable claims.
  • Example data: “Entity” might use a historic figure (Cut Nyak Dhien → Aceh); “Geographical” might use Derawan Islands → specific province.
  1. Generate bilingual questions (Indonesian and English)
  • What happens: Use an LLM (Claude-3.7-Sonnet) to produce both versions at once, keeping culture terms (Rumoh Aceh, ulos) in original language.
  • Why this step exists: Indonesia is multilingual; testing both languages checks cross-lingual reasoning and preserves authenticity.
  • Example: English and Indonesian versions of the same two-hop question share identical cultural terms.

🍞 Top Bread (Hook) Imagine building a bridge, then shaking it to see if it’s sturdy.

🥬 Filling (The Actual Concept)

  • What it is: Multi-stage validation filters and fixes low-quality items.
  • How it works: a) Human spot-checking: sample 3,000 items; rate OK/Minor/Moderate/Significant. b) LLM-as-a-judge: three judges (GPT-4o/5 family, Claude-3.7-Sonnet, DeepSeek-V3) score factuality, structure, clarity, language; keep items with majority Acceptable and drop any with a single Significant. c) Structure verification: detect copied options or province name leaks; rewrite to remove shortcuts without losing terms. d) Language rebalance: fill missing language pairs to ensure both EN/ID exist. e) Naturalness and difficulty ratings: native speakers label Natural/Acceptable/Unnatural and Easy/Moderate/Hard; revise Unnatural items.
  • Why it matters: Auto-generation can drift; layered checks recover quality at scale.

🍞 Bottom Bread (Anchor) If a comparison claim lacks a year or unique ranking, it’s revised or removed to avoid misleading clues.

  1. Dataset outcome and stats
  • Output: 15,590 two-hop questions, 7,795 per language. Distribution across 6 clue types and 12 cultural topics spanning 11 provinces.
  • Example: COMPARISON is the smallest because it’s hardest to verify and was most often filtered.

The Secret Sauce

  • Build from province-unique seeds to ensure a solid first hop.
  • Enforce six diverse clue styles to test different reasoning muscles.
  • Keep culture words un-translated to preserve meaning.
  • Combine human expertise and LLM judges for scalable, reliable quality.

🍞 Top Bread (Hook) Think of this like assembling a bicycle and then road-testing it on hills, flats, and bumpy roads.

🥬 Filling (The Actual Concept)

  • What it is: The pipeline is a recipe that systematically adds challenge, checks quality, and balances languages.
  • How it works: Seed (IndoCulture) → clue (six types) → bilingual generation → multi-stage validation → final release.
  • Why it matters: Each stage catches different problems so the final set truly measures cultural reasoning.

🍞 Bottom Bread (Anchor) Geographical clues avoid rivers that cross provinces; if a question slipped and named “Bali province,” it’s rewritten to use indirect hints instead.

04Experiments & Results

🍞 Top Bread (Hook) Imagine a school test where you must first name the city from hints and then pick the right custom for a special event in that city.

🥬 Filling (The Actual Concept)

  • What it is: The evaluation measures two-hop accuracy—first guessing the province from a clue, then selecting the culturally correct option.
  • How it works: Models respond in a strict format: PROVINCE: … ANSWER: A/B/C. We also compare models of different sizes and languages (EN and ID) and test Chain-of-Thought prompting.
  • Why it matters: This shows whether AIs can go beyond memorizing famous facts to making context-appropriate cultural choices.

🍞 Bottom Bread (Anchor) A model may identify “Aceh” correctly but still pick a ceremonial food that doesn’t fit a casual-dining scenario.

The Test

  • Tasks: province identification (first hop) + final cultural answer (second hop).
  • Languages: English and Indonesian.
  • Models: Frontier (GPT-5, Claude-3.7-Sonnet, DeepSeek-V3), large open models (Llama, Qwen, Gemma), and region-specific small models (Merak-7B, SeaLLM-7B).
  • Human baseline: Three native speakers answered all 7,795 questions (per language); human multi-hop accuracy ~70%, with first-hop ~95%.

The Competition

  • Compared to humans: Frontier models surpass human multi-hop accuracy by >10 points in Indonesian and similarly strong in English.
  • Compared across clue types: No single type is universally easiest; strengths vary by model.

The Scoreboard (with context)

  • Frontier leaders: Claude-3.7-Sonnet and GPT-5 top the board around 81% overall; DeepSeek-V3 follows mid-to-high 70s. That’s like scoring A/A− when many others score B/C.
  • Smaller and region-tuned models: Merak-7B and SeaLLM-7B were strong on easier single-hop tests but drop to about 51% here, showing multi-hop cultural reasoning is much tougher.
  • First-hop vs second-hop: Frontier models exceed 96% on province prediction but are 18–23 points lower on full two-hop accuracy. This means they often find the right place but still miss the best culturally fitting choice.
  • Province familiarity effect: Humans do best on well-known regions (Bali, West/Central Java) but drop on less familiar ones (Papua, Aceh). Frontier models stay steadier, likely due to broader training data.

Chain-of-Thought (CoT) Results

  • Adding “Let’s think step by step” gives small average gains (around 1–3 percentage points), biggest for GPT-5. But CoT can also hurt in some model-language-type combinations, so it’s not a magic fix.

Surprising Findings

  • Fame vs fit: Models often pick the most famous tradition instead of the right one for the situation. For example, choosing a ceremonial Acehnese dish for a casual meal.
  • Majority patterns: In many failures, all three frontier models chose the same wrong answer, showing shared biases from similar training patterns.
  • Category quirks: COMPARISON questions were the most fragile to generate and validate, and models sometimes struggled more with them.

🍞 Top Bread (Hook) If you’ve ever mixed up cousins who look alike, you know why precise clues matter.

🥬 Filling (The Actual Concept)

  • What it is: Error analysis shows models confuse regions with shared cultural features (e.g., Javanese provinces) unless the clue is very specific.
  • How it works: Strong first-hop anchoring reduces mix-ups but doesn’t guarantee the right cultural action.
  • Why it matters: Getting the place right is necessary but not sufficient; models must also respect situational norms.

🍞 Bottom Bread (Anchor) Models correctly link “bakar batu” to Papua but then wrongly assume “communal sharing” fits the question, missing the real practice of selling pork by the kilogram at a local market.

05Discussion & Limitations

Limitations

  • Region scope: ID-MoCQA focuses on Indonesia; cultural reasoning in other countries may need their own datasets.
  • Generation difficulty: COMPARISON and INTERSECTION clues are hard to verify automatically; they needed heavy filtering and still remain tricky.
  • Biases remain: Models show a “prominent-practice” bias (choosing the most famous option) and sometimes apply majority religious or patriarchal patterns where local norms differ.
  • Partial automation: LLM-as-a-judge is helpful but not perfect; human validation is still needed for the toughest edge cases.

Required Resources

  • For building similar datasets: access to native experts, capable LLMs for generation and judging, and time for iterative validation.
  • For evaluation: compute to run multiple models across two languages and code to parse structured outputs.

When NOT to Use

  • If your task only needs simple fact lookup (single-hop), this benchmark is overkill.
  • If you need live, changing data (e.g., current events), note that ID-MoCQA is static and historically grounded.
  • If you can’t preserve local terms or bilingual structure, you may lose cultural precision.

Open Questions

  • How to reduce the “pick the famous thing” bias reliably?
  • What kinds of training (preference tuning, debiasing, retrieval) best improve situational cultural fit?
  • Can explanations (faithful CoT) help humans trust model choices for sensitive cultural advice?
  • How to scale this approach to many regions while keeping authenticity and verification strength?
  • What’s the best way to teach models differences among similar neighboring cultures (fine-grained disambiguation)?

06Conclusion & Future Work

3-Sentence Summary This paper introduces ID-MoCQA, a bilingual Indonesian cultural benchmark that forces two-step reasoning: first identify the province from a clue, then choose a culturally appropriate answer. It presents a full framework to convert single-hop questions into multi-hop ones across six clue types and validates them using both human annotators and LLM-as-a-judge. Experiments show frontier models beat human baselines overall but still stumble on nuanced cultural fit, revealing biases toward famous practices over situational correctness.

Main Achievement A robust, large-scale, carefully validated dataset and pipeline that make cultural reasoning unavoidable—no shortcuts—so we can really test and improve AI’s cultural competence.

Future Directions

  • Debias models against “prominent practice” shortcuts via preference tuning and targeted training.
  • Add explainable reasoning that highlights why an option fits a situation.
  • Expand to more regions and languages with equally strong validation.
  • Explore retrieval or knowledge editing to anchor subtle regional differences.

Why Remember This ID-MoCQA changes the question from “Does AI know facts?” to “Can AI act respectfully and appropriately in context?” That shift—from trivia to thoughtful behavior—is essential for building AI that supports people in real communities, with real traditions, in everyday life.

Practical Applications

  • •Train chatbots for tourism to recommend region-appropriate gifts, foods, and etiquette.
  • •Support teachers with accurate, context-aware cultural lessons across Indonesian provinces.
  • •Improve healthcare communication tools to respect local customs around pregnancy and family.
  • •Guide government service bots to use correct forms of address and regional practices.
  • •Enhance e-commerce recommendations for culturally appropriate souvenirs by region and occasion.
  • •Assist journalists and researchers in fact-checking region-specific cultural claims.
  • •Upgrade virtual assistants for domestic workers and caregivers with appropriate household etiquette by region.
  • •Help NGOs tailor community outreach messages to local norms and celebrations.
  • •Improve translation/localization systems by preserving culture-specific terms and their contexts.
  • •Benchmark and tune LLMs for fairer, bias-aware cultural reasoning beyond Indonesia.
#multi-hop question answering#cultural reasoning#Indonesian culture#LLM-as-a-judge#bilingual dataset#contextual reasoning#cultural competence#commonsense clues#comparison clues#entity clues#geographical clues#intersection clues#temporal clues#Chain-of-Thought#benchmarking LLMs
Version: 1