🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
đŸ›€ïžPaths📚Topics💡Concepts🎮Shorts
🎯Practice
đŸ§©Problems🎯Prompts🧠Review
Search
The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models | How I Study AI

The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models

Intermediate
Yan Wang, Yitao Xu, Nanhan Shen et al.1/6/2026
arXivPDF

Key Summary

  • ‱Mixture-of-Experts (MoE) language models don’t split cleanly into domain specialists; instead, a small, stable group of experts gets chosen again and again across many subjects.
  • ‱The authors introduce COMMITTEEAUDIT, a careful after-the-fact (post hoc) checkup that looks at groups of experts, not just single ones, to see who really does the work.
  • ‱Across three different MoE models on the MMLU benchmark, the same compact “Standing Committee” of experts captures most of the routing mass in many layers and domains.
  • ‱Even models that already include always-on shared experts still form a Standing Committee among the routed experts, showing centralization is an emergent behavior.
  • ‱Metrics show strong sharing (Jaccard similarity ≈ 0.87–0.87+) and extreme contribution inequality (Gini often > 0.9), meaning few experts do most of the work.
  • ‱Qualitative checks suggest the Standing Committee handles reasoning structure and grammar, while fringe (peripheral) experts fetch domain facts like chemistry or law terms.
  • ‱Load-balancing losses that push for uniform expert use may fight against this natural centralization, potentially wasting compute and slowing training.
  • ‱The Standing Committee stays small (about 2–5 experts) yet can cover up to ~70% of the routing mass, even as the total number of experts grows.
  • ‱When the top-k routing budget changes a lot, committee membership can shuffle, but the overall centralization pattern remains.
  • ‱This work suggests future MoE training should embrace a core–periphery design instead of forcing every expert to be equally used.

Why This Research Matters

If only a tiny group of experts carries most of the work, we can train and tune models more efficiently by focusing on that core. Embracing a core–periphery design could improve accuracy and stability instead of wasting compute on forcing equal expert usage. System builders can monitor and protect the Standing Committee to make models more reliable and interpretable. Cloud costs and energy use can drop if we optimize for the few experts that matter most at inference. Finally, understanding that reasoning and syntax are centralized helps explain model behavior and guide safer, more targeted improvements.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your school has clubs for math, art, science, and sports. You might think each club only does its own things. But what if, behind the scenes, the same small group of student leaders quietly helps every club run smoothly—setting rules, organizing schedules, and keeping things on track?

đŸ„Ź The Concept (Mixture-of-Experts, introduced below): Before this paper, most people believed MoE language models worked like totally separate clubs. Each “expert” would handle one domain—like a math expert for math, a law expert for law. The big hope: you could grow a model’s total brainpower without paying the full cost every time, because only a few experts would wake up per token (sparse routing). This sounded like divide-and-conquer: different types of questions go to different specialists.

How it worked in practice:

  1. Replace a Transformer’s feed-forward block with many experts.
  2. A small router picks top-k experts for each token.
  3. Only those experts compute, saving time and compute.
  4. Ideally, different domains use different experts.

Why it mattered: If true specialization happened, we’d get big, fast models that are also efficient. Like having a giant team where only the right people show up for each task.

🍞 Anchor: Think of asking questions in math class versus history class. Many imagined MoE would send math words to math experts and history words to history experts, like teachers in different classrooms.

What was the real-world situation? As MoEs got bigger and better, researchers noticed oddities. Sometimes experts didn’t specialize neatly. Some experts acted like generalists across many tasks. People also worried about “representation collapse,” where routing fails and many experts barely get used at all. To fix this, some designs (like DeepSeek) added “shared experts” that are always on, hoping to separate general knowledge from the routed specialists.

The problem: Even after adding shared experts, models still didn’t behave like tidy sets of specialists. A few routed experts kept showing up almost everywhere, again and again, across different subjects. It felt like there was a hidden central team doing most of the work.

Failed attempts and why they weren’t enough:

  • Counting single expert activations: Looked at who lights up most often, but missed how experts co-activate as a group.
  • Forcing balance with load-balancing losses: Ensured more uniform usage, but may fight the model’s natural habit of centralizing core reasoning.
  • Declaring “super experts” by frequency: Spotted popular experts, but didn’t reveal stable coalitions that appear together across domains.

The gap: We lacked a group-level view. Were experts truly independent specialists, or did they form reliable coalitions that the router reused across many tasks?

The stakes (why daily life should care):

  • Training efficiency: If the model naturally funnels work to a small core, pushing uniformity may waste compute and slow learning.
  • Performance: Embracing a core–periphery design might boost accuracy and stability.
  • Reliability and interpretability: If a small “Standing Committee” anchors reasoning and syntax, we can monitor and improve that core more directly.
  • Deployment costs: Understanding which experts really matter can reduce energy usage and inference latency by focusing optimization where it counts.

Introducing the key ideas we’ll use to investigate:

  • We start with Mixture-of-Experts (MoE) so you know the parts.
  • Then we meet the Standing Committee: a tiny team of experts that show up across many domains.
  • Then COMMITTEEAUDIT: a careful, post-hoc audit that studies groups instead of single experts.
  • Along the way, we use three simple scoreboards—Expert Contribution Index (ECI) to measure who carries weight, Jaccard Similarity to see how much the same experts are reused across domains, and the Gini Coefficient to show how unequal the contributions are.

Together, these tools tell a new story: MoEs don’t just split into neat specialists—they grow a dependable, domain-invariant core that does a lot of the heavy lifting.

02Core Idea

🍞 Hook: You know how, in group projects, there’s always a small set of classmates who make the outline, set the plan, and keep everyone on track—no matter the subject? They aren’t the only contributors, but they shape the whole project.

đŸ„Ź The Aha! Moment (one sentence): In Mixture-of-Experts models, a compact, domain-invariant “Standing Committee” of experts consistently does most of the work across many subjects, while peripheral experts add domain details.

Multiple Analogies (3 ways):

  1. School Council: Different classes (math, art, history) run events, but the same student council handles the core rules and schedule.
  2. Orchestra: Many instruments exist, but a stable conductor group (committee) sets tempo and structure; soloists (peripheral experts) add flavor for specific pieces.
  3. Kitchen: A restaurant has many cooks, but a small head-chef team sets the base recipe and timing; specialist cooks add spices for a given cuisine.

Before vs After:

  • Before: We assumed separate expert teams for each domain, like different classrooms with totally different teachers.
  • After: We find a repeated, small committee that shows up almost everywhere to anchor reasoning and syntax, while other experts handle domain-specific facts.

Why it works (intuition, no equations):

  • Language has lots of shared structure (grammar, question patterns, logical steps). It’s efficient to centralize that into a small reusable core.
  • Sparse routing rewards strong, dependable experts: once a few experts get good at general reasoning, the router keeps choosing them.
  • Bigger expert pools don’t force diversity; they can increase centralization, because the router prefers reliable patterns that work across many inputs.

Building Blocks (explained in the best learning order, each with the Sandwich pattern):

  1. 🍞 Hook: Imagine a classroom where the teacher chooses a few helpers each time a question is asked, so not everyone needs to speak. đŸ„Ź The Concept (Mixture-of-Experts):

    • What it is: An AI model with many “experts,” where a router picks only a few (top-k) to answer each token.
    • How it works:
      1. Many experts wait in a pool.
      2. A router scores experts for each token.
      3. Only the top-k experts compute.
      4. The model mixes their answers.
    • Why it matters: It saves compute and lets the model scale, like calling only the right helpers when needed. 🍞 Anchor: When the model sees “What is 2+2?”, it may pick math-leaning experts; for “Who painted the Mona Lisa?”, it may pick general reasoning plus art-knowledge experts.
  2. 🍞 Hook: Think of a small team of students who keep getting picked to run meetings for every club because they’re great at structure. đŸ„Ź The Concept (Standing Committee):

    • What it is: A small group of experts that get routed to across many domains, acting as a domain-invariant core.
    • How it works:
      1. Across tasks, record which experts the router picks most.
      2. Find the ones that keep showing up near the top.
      3. Confirm they appear across many domains and layers.
    • Why it matters: Without this core, the model would waste time relearning structure for every domain; with it, reasoning and syntax stay steady. 🍞 Anchor: Whether the question is biology or law, tokens like “Which,” “Suppose,” and “?” often go to the same committee experts.
  3. 🍞 Hook: Imagine auditing a student council by checking attendance, roles, and consistency over the whole year. đŸ„Ź The Concept (COMMITTEEAUDIT):

    • What it is: A post-hoc framework that measures group-level routing patterns to identify stable expert committees.
    • How it works:
      1. Collect full routing weights for tokens across domains.
      2. Build domain-level profiles (who’s used and how much).
      3. Check that domains are well-separated clusters (so profiles are meaningful).
      4. Rank experts per domain, count cross-domain repeats, and select stable, high-ranking experts via a Pareto trade-off of mean rank and stability.
    • Why it matters: Looking at groups (not single experts) reveals the hidden Standing Committee you’d miss by counting one expert at a time. 🍞 Anchor: After the audit, you can point to 3–5 experts that reliably carry 50–70% of the routing mass.
  4. 🍞 Hook: Like a scoreboard showing how many points each player actually scored, not just how often they touched the ball. đŸ„Ź The Concept (Expert Contribution Index, ECI):

    • What it is: The average routing weight an expert receives for a domain—a measure of its real contribution.
    • How it works:
      1. For each token, note the router’s weight for each expert.
      2. Average those weights over a domain’s data.
      3. Use the averages to compare experts.
    • Why it matters: Frequency alone can mislead; ECI shows who truly carries load. 🍞 Anchor: If Expert A has higher ECI than Expert B in math, A is doing more of the math work overall.
  5. 🍞 Hook: To see how similar two friend groups are, you count how many friends they share. đŸ„Ź The Concept (Jaccard Similarity):

    • What it is: A number showing how much two sets overlap (shared experts Ă· total unique experts).
    • How it works:
      1. Pick the top-k experts for domain 1 and domain 2.
      2. Count how many are in both lists.
      3. Divide by how many are in either list.
    • Why it matters: High Jaccard means domains reuse the same experts—a sign of a Standing Committee. 🍞 Anchor: If math and history share most of their top experts, Jaccard is close to 1.
  6. 🍞 Hook: Picture a pizza party: if a few kids get huge slices and others get crumbs, the split is unequal. đŸ„Ź The Concept (Gini Coefficient):

    • What it is: A measure of how uneven contributions are across experts.
    • How it works:
      1. Collect each expert’s ECI.
      2. Compare all pairs to see differences.
      3. Turn that into a single inequality score between 0 (even) and 1 (very uneven).
    • Why it matters: A high Gini screams “few experts do most of the work,” supporting the Standing Committee idea. 🍞 Anchor: Seeing Gini > 0.9 means a tiny expert group dominates the model’s computation.

03Methodology

At a high level: Input (MMLU questions) → Collect routing weights per token and layer → Build domain profiles → Check that profiles are meaningful → Rank and filter experts across domains → Select a Pareto-stable Standing Committee → Analyze coverage, size, and stability.

Step-by-step like a recipe, with what/why/examples:

  1. Gather routing signals (full preferences, not just the winners)
  • What happens: For each MoE layer, the model’s router gives a probability-like weight to every expert for each token. Instead of only recording the top-k winners, we keep the full distribution (the whole “preference list”). We focus on the final token per sample to standardize comparison, but the method generalizes.
  • Why this step exists: Top-k alone throws away useful information about near-winners and the overall shape of preferences. Keeping the full vector lets us detect stable structures hidden beyond the winners.
  • Example: Suppose 8 experts exist. For a token, weights might look like [0.30, 0.25, 0.15, 0.10, 0.08, 0.06, 0.04, 0.02]. Even if top-2 are chosen (0.30 and 0.25), the rest still reveal a preference pattern.
  1. Build domain-conditioned routing profiles
  • What happens: For each domain (e.g., STEM–Math or Humanities), average each expert’s weights over all samples in that domain. This average is the Expert Contribution Index (ECI) per domain.
  • Why this step exists: We want a stable footprint of how much each expert helps in a domain, not just one-off decisions.
  • Example: Over all math questions, Expert #4 might average 0.18 weight, while Expert #7 averages 0.03—showing Expert #4 is a heavier math contributor.
  1. Check that domains form meaningful profiles
  • What happens: We compute a simple group-separation score (silhouette-style) to ensure that routing vectors within a domain are more similar to each other than to vectors from other domains. If a domain doesn’t show a coherent pattern, we don’t over-interpret it.
  • Why this step exists: If domain profiles are noisy and overlapping, committee discovery could be unreliable.
  • Example: If physics tokens’ routing vectors cluster tightly and far from law tokens’ vectors, physics is “well-structured.”
  1. Rank experts per domain and count cross-domain repeats
  • What happens: Within each domain, we rank experts by contribution (using ECI and top-k ranks). We then count how often each expert appears among the top-k across domains.
  • Why this step exists: Standing Committee members should appear frequently as top contributors across many domains, not just one.
  • Example: If Expert #12 is top-k in 8 out of 9 domains, that’s a strong committee candidate.
  1. Keep only consensus candidates
  • What happens: We set a high threshold (e.g., in the paper, appearing in ≄ 80% of domains) and keep experts that meet or exceed it.
  • Why this step exists: This filters out domain-only stars and keeps experts that are broadly useful.
  • Example: Expert #3 appears in top-k lists for math, physics, chemistry, biology, history, law, and CS—but not literature. That’s still ≄ 80%, so #3 survives.
  1. Measure stability: mean rank and variability
  • What happens: For each candidate, we compute its average rank across domains and how much that rank wiggles (variance). Lower variability means more consistent importance.
  • Why this step exists: A true committee member should be both strong (high average rank) and steady (low variability) across domains.
  • Example: Expert #5 averages rank 2.8 with tiny variance; Expert #9 averages rank 3.1 but with big swings. #5 is more stably valuable.
  1. Pick the Standing Committee via a Pareto trade-off
  • What happens: We select the Pareto-optimal set—experts that offer the best trade-offs between being highly ranked on average and being stable across domains.
  • Why this step exists: It avoids picking an expert that is incredible in one domain but unreliable overall, or super-stable yet too weak.
  • Example: Experts A, C, and D survive Pareto filtering; they’re both strong and steady.
  1. Quantify committee coverage, size, and persistence
  • What happens: We add up the committee’s contribution (coverage) and track committee size across layers and routing budgets (different top-k values). We also compute Jaccard similarity between domains’ top-k expert sets and Gini coefficients over contributions.
  • Why this step exists: To demonstrate that a small group truly dominates (high Gini), keeps reappearing across domains (high Jaccard), and accounts for a big chunk of the routing mass (high coverage), even as k or depth changes.
  • Example: In OLMoE, mid-layer committee of 2 experts covers ~30%+; in Qwen3-30B-A3B, committees of 3–5 experts often cover 50–67%.

The secret sauce (what’s clever):

  • Group-first lens: Instead of asking “which single expert is hot?”, we ask “which small coalition keeps showing up together across many domains and layers?”
  • Full preference use: By retaining full routing distributions, we detect stable committee patterns beyond just the winners.
  • Pareto stability: Selecting experts that are both strong and steady avoids flukes and surfaces a truly domain-invariant core.

What would break without each step:

  • Without full routing vectors: You’d miss near-winners and underestimate the central core.
  • Without domain profiles: You couldn’t compare across tasks meaningfully.
  • Without stability checks: You might mistake noisy stars for reliable committee members.
  • Without inequality and overlap metrics (Gini, Jaccard): You couldn’t show centralization and cross-domain sharing clearly.

04Experiments & Results

The test (what and why): The authors evaluated three MoE LLMs—OLMoE (64 experts, k=8), DeepSeek-V2-Lite (64 experts, k=16, with shared experts), and Qwen3-30B-A3B (128 experts, k=8)—on the MMLU benchmark. They measured whether the same experts kept reappearing across nine broad domains and how unevenly contributions were distributed. This reveals whether specialization (different experts per domain) or centralization (the same committee reused) dominates.

The competition (what they compared against): The common assumption that MoEs specialize by domain, and prior analyses that focused on single experts or activation frequency. COMMITTEEAUDIT instead evaluates expert groups, cross-domain overlap, and contribution inequality.

The scoreboard (numbers with context):

  • Cross-domain sharing (Jaccard similarity of top-k sets):

    • OLMoE mean ≈ 0.8735 (Min ≈ 0.7963; Max 1.0)
    • DeepSeek-V2-Lite mean ≈ 0.8670 (Min ≈ 0.7103; Max 1.0)
    • Qwen3-30B-A3B mean ≈ 0.8670 (Min ≈ 0.5300; Max 1.0) Interpreting this: Around 0.87 is like saying “most of the time, the same experts get picked across different subjects,” which is a strong sign of a Standing Committee.
  • Contribution inequality (Gini over ECI):

    • OLMoE overall ≈ 0.8957
    • DeepSeek-V2-Lite overall ≈ 0.9207
    • Qwen3-30B-A3B overall ≈ 0.9465 (peaks up to ≈ 0.9605) Interpreting this: Numbers above 0.9 are extreme. It’s like in a class project where a tiny group does almost all the work.
  • Committee size and coverage (examples):

    • DeepSeek-V2-Lite: Mid-layer committee of 3 experts covers ≈ 60.7%, deep-layer 4 experts cover ≈ 70.5%.
    • OLMoE: Mid-layer 2 experts cover ≈ 29.7%; deep-layer 3 experts cover ≈ 44.0%.
    • Qwen3-30B-A3B: Mid-layer 5 experts cover ≈ 67.0%; deep-layer 3 experts cover ≈ 50.9%. Interpreting this: A small group (2–5 experts) often covers the majority of the computation—like 3–5 students doing over half the project.

Surprising findings:

  • Bigger expert pools didn’t dilute the committee; they often increased centralization (Qwen had the highest Gini). That’s like adding more team members but the same few still do most work.
  • High overlap across layers: Jaccard stayed high (often ≄ 0.8), meaning the same experts appear repeatedly as inputs flow deeper, not just at one layer.
  • Robust but not rigid: Changing top-k (e.g., 4, 6, 8, 12, 16) showed the committee shape can shift outside its preferred setting (k=8 for OLMoE), but the centralization pattern persists.
  • Functional roles split: The Standing Committee consistently handles tokens tied to reasoning and syntax (e.g., “Which,” “Suppose,” “the,” “in”), while domain-specific terms spread across many peripheral experts.

Big picture: Across three models—with and without always-on shared experts—the core result holds. The router forms a domain-invariant Standing Committee that captures a large fraction of routing mass. This suggests centralization is not a quirk but an emergent property of sparse routing.

05Discussion & Limitations

Limitations (what this can’t do yet):

  • Architectural coverage: Only three MoE families were studied. We don’t know if hierarchical, hybrid, or dynamically adaptive routers show the same strength of committees.
  • Causality: This is an observational, inference-only audit. We didn’t “turn off” committee members to measure exact causal impact.
  • Task breadth: Most results rely on MMLU-style domains. Multi-step chats, coding, or tool-use may have different committee dynamics.
  • Training dynamics: The framework inspects trained models; it doesn’t yet track when and how committees form during training.

Required resources:

  • Access to MoE models and the ability to extract routing weights for many samples (GPU hours for inference and aggregation).
  • Basic analytics for ranking, overlap (Jaccard), inequality (Gini), and Pareto selection.

When not to use:

  • If your model doesn’t expose routing weights or uses fully dense computation, COMMITTEEAUDIT won’t apply as-is.
  • If domains are poorly defined or heavily mixed (e.g., noisy conversational data), domain profiles may be unstable and committees less clear.
  • If you must exactly balance expert usage for hardware reasons, the audit may suggest a training direction you cannot adopt.

Open questions:

  • Can we design training losses that embrace the core–periphery pattern instead of fighting it, improving both efficiency and accuracy?
  • How early do committees emerge during training, and can we speed up their formation?
  • Can we stabilize committee membership across routing budgets and prompts without harming flexibility?
  • Can targeted interventions (dropouts, swaps, or regularizers) strengthen the reasoning core while encouraging healthier, purposeful peripheral roles?

Honest takeaway: The evidence for Standing Committees is strong across varied MoEs, but we still need causal tests and broader benchmarks to confirm how universal and beneficial this structure is in practice.

06Conclusion & Future Work

Three-sentence summary: This paper shows that Mixture-of-Experts models naturally form a small, domain-invariant “Standing Committee” of experts that handles reasoning and syntax across many subjects. A new group-level auditing tool, COMMITTEEAUDIT, reveals high cross-domain expert overlap and extreme contribution inequality, meaning a few experts do most of the work. This challenges the usual assumption of domain specialization and suggests that load-balancing losses may fight the model’s natural optimization path.

Main achievement: Providing clear, multi-model evidence and a principled, post-hoc framework that surfaces a persistent core–periphery organization in MoEs.

Future directions:

  • Train-time methods that explicitly support a reasoning core plus purposeful periphery, instead of enforcing uniform usage.
  • Causal ablations to quantify committee necessity and discover minimal, robust cores.
  • Extending audits to conversations, tools, coding, and multimodal tasks, and to new routing architectures.
  • Monitoring committee formation during training to improve convergence and efficiency.

Why remember this: The big idea flips the usual story—MoEs don’t mainly win by perfect domain splits; they win by centralizing general reasoning in a small committee and sprinkling in domain facts from the edges. Designing with that truth in mind can save compute, improve stability, and make these models easier to understand and control.

Practical Applications

  • ‱Design new MoE training losses that support a strong reasoning core while encouraging purposeful, on-demand periphery roles.
  • ‱Prioritize pruning, quantization, or distillation efforts on the Standing Committee to reduce cost with minimal quality loss.
  • ‱Add monitoring dashboards that track committee coverage, Jaccard overlap, and Gini to catch routing regressions early.
  • ‱Target fine-tuning or instruction tuning primarily at committee members to improve general reasoning quickly.
  • ‱Allocate hardware and caching resources preferentially to committee experts for lower latency and better throughput.
  • ‱Perform causal ablations (masking or swapping committee experts) to validate necessity and improve robustness.
  • ‱Adjust top-k per layer based on committee stability to trade off compute vs. accuracy smartly.
  • ‱Route safety or policy checks through committee experts to stabilize behavior before peripheral details are added.
  • ‱During domain adaptation, train peripheral experts to capture new facts while keeping the committee steady.
  • ‱Use COMMITTEEAUDIT post-hoc on new MoE variants to guide architecture choices and hyperparameters.
#Mixture-of-Experts#Standing Committee#Sparse routing#Expert Contribution Index#Jaccard similarity#Gini coefficient#Core–periphery architecture#Load balancing loss#MoE interpretability#ECI-based ranking#Pareto selection#MMLU analysis#Expert centralization#Shared experts#Top-k routing
Version: 1