SocialVeil: Probing Social Intelligence of Language Agents under Communication Barriers

Keyang Xuan; Pengda Wang; Chongrui Ye; Haofei Yu; Tal August; Jiaxuan You

SocialVeil: Probing Social Intelligence of Language Agents under Communication Barriers

Intermediate

Keyang Xuan, Pengda Wang, Chongrui Ye et al.2/4/2026

arXiv PDF

Key Summary

•This paper builds SocialVeil, a testing world where AI chat agents must talk to each other even when communication is messy, not perfect.
•It adds three realistic barriers: semantic vagueness (unclear words), sociocultural mismatch (different communication styles), and emotional interference (strong feelings make ideas fuzzy).
•SocialVeil measures more than task success by adding two new scores: Unresolved Confusion (how much confusion is left) and Mutual Understanding (how well both sides truly align).
•Across 720 role-play scenarios and four popular language models, barriers cut mutual understanding by over 45% and boost confusion by nearly 50%.
•Each barrier hurts in a different way: vagueness wrecks shared understanding, emotions damage relationships, and cultural mismatch keeps confusion hanging around.
•Human judges agreed strongly with the automatic scores (ICC about 0.78; Pearson r about 0.80), showing the tests are realistic and reliable.
•Simple advice like “ask clarifying questions” barely helps, while interactive learning makes steady but small gains (about 10–20%), still far from no-barrier performance.
•The study shows social reasoning (like building trust and understanding) is more fragile than just completing tasks when communication gets tough.
•SocialVeil is a step toward evaluating and training AI that can notice, handle, and fix real-world misunderstandings.
•This matters for daily life uses like help desks, tutoring, teamwork tools, and healthcare chats, where communication is often imperfect.

Why This Research Matters

In real life, conversations are messy—people hedge, come from different cultures, and feel strong emotions. SocialVeil helps ensure AI can still collaborate, support, and advise under those imperfect conditions. This matters for customer help desks, tutoring, healthcare triage, team collaboration tools, and more, where misunderstandings can be costly. By measuring confusion and mutual understanding, not just task completion, we prevent “paper-thin success” that leaves users baffled. The framework’s realism and human-validated scoring make it a trustworthy way to diagnose and improve AI social skills. Over time, it can guide training that builds AI agents which notice, navigate, and repair misunderstandings, making them safer and more helpful in everyday life.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine trying to play a team sport while wearing foggy goggles. You can still move and kick, but it’s much harder to pass well because you can’t see your teammates clearly.

🥬 The Concept (Communication Barriers):

What it is: Communication barriers are things that make it hard for people (or AIs) to understand each other.
How it works (step by step):
1. Someone sends a message.
2. The message gets bent or blurred by something—unclear words, different cultural habits, or strong emotions.
3. The receiver guesses the meaning based on what they heard and their own background.
4. If the guess is off, confusion builds unless both sides repair the misunderstanding.
Why it matters: Without handling barriers, conversations look smooth on the surface but break inside, leading to wrong actions, hurt feelings, or unfinished goals. 🍞 Anchor: Two students text about a project. One says, “Let’s handle it soon.” The other thinks “soon” means today, but the first meant next week. The project slips.

The world before: For years, AI chat systems were tested like they were talking in a quiet library with perfect lighting—no misunderstandings, no hidden assumptions, no strong feelings. Benchmarks asked: Can the AI answer questions, role-play nicely, and reach goals? Many models got very good at these tests. But real conversations are rarely perfect. People hedge (“maybe, kinda”), come from different cultures (“we’ll think about it” can mean “no” in some places), and carry emotions (“I’m too upset to explain”).

The problem: Existing social benchmarks often assume both agents share the same language habits, norms, and calm tone. That hides what really matters in the wild: Can an AI notice trouble, fix confusion, and keep the relationship healthy while still getting things done?

Failed attempts: Some projects injected random noise (like word scrambling) to make conversations harder. But this rarely felt realistic; the mess looked artificial, not like real human miscommunications. Others used totally free-form prompts to create chaos, but that lost control and made runs hard to compare.

The gap: We needed a way to add realistic, literature-backed communication barriers that are structured (so tests are fair and repeatable) and to measure outcomes that go beyond just finishing a task. We also needed metrics that check the health of understanding between partners, not only whether someone bulldozed to a result.

🍞 Hook: You know how sometimes your friend says “you know what I mean,” and you don’t—but you nod anyway?

🥬 The Concept (Semantic Vagueness):

What it is: Semantic vagueness is when words are too fuzzy or empty to pin down a clear meaning (like saying “it,” “that,” or “soon” without context).
How it works:
1. Speaker uses vague words or placeholders.
2. Listener fills in the blanks using guesses.
3. If guesses differ, confusion grows—even if both are polite.
Why it matters: Without clarity, you can’t build a shared map of what’s going on, so plans fall apart. 🍞 Anchor: “Let’s meet there later to fix it.” Where’s “there”? What’s “it”? No wonder nothing gets fixed.

🍞 Hook: Have you noticed how one culture values direct “no,” while another might say, “We’ll think about it,” and mean the same thing?

🥬 The Concept (Sociocultural Mismatch):

What it is: People from different cultures can misread each other’s style (directness, politeness, context), causing mismatched interpretations.
How it works:
1. Speaker uses a culturally normal signal (e.g., indirect refusal).
2. Listener interprets it through their own cultural lens (e.g., thinks it’s a delay, not a no).
3. Expectations drift apart, and confusion sticks.
Why it matters: Mismatched styles can quietly derail teamwork and trust even when everyone has good intentions. 🍞 Anchor: A manager says, “We’ll consider it,” meaning “no.” The employee keeps preparing a proposal, thinking it’s a “maybe.” Time and energy are wasted.

🍞 Hook: Imagine trying to solve a puzzle while you’re really upset—you can barely focus on the pieces.

🥬 The Concept (Emotional Interference):

What it is: Strong feelings (anger, anxiety, frustration) push aside clear thinking, making it hard to give or receive information.
How it works:
1. Emotion floods attention.
2. Details get dropped or distorted.
3. Messages become more about feelings than facts.
Why it matters: Even simple tasks can stall if feelings swamp the signal. 🍞 Anchor: “I’m too mad to explain—just figure it out!” The listener can’t fix what they can’t understand.

Real stakes: Think of customer support, doctor-patient chats, classroom help, or workplace negotiations. If an AI can’t spot and repair misunderstandings, it may give wrong advice, frustrate users, or damage relationships. The paper’s big idea is to pull testing out of the quiet library and into the noisy hallway of real human talk—then see which AI agents can still play well with others.

02Core Idea

🍞 Hook: You know how a good driving test doesn’t only check parking on an empty road—it also tests you during rain, at night, and around sudden detours?

🥬 The Concept (SocialVeil’s Aha!):

What it is: SocialVeil is a testing world that adds realistic communication barriers on purpose to see if AI agents can still understand, repair, and relate.
How it works:
1. Build role-play scenarios with hidden goals (so each agent has private aims).
2. Inject one barrier (vagueness, cultural mismatch, or emotional interference) into just one agent.
3. Let both agents talk for multiple turns.
4. Score not only task success but also confusion left over and mutual understanding.
Why it matters: If we don’t test under messy conditions, we can’t trust AI to work well in the real world. 🍞 Anchor: Two agents negotiate room chores. One speaks vaguely (“Let’s handle that thing later”), and SocialVeil measures whether the other can clarify and align.

One-sentence key insight: To truly measure social intelligence, we must test AI in conversations where meaning is hard to build—and check if they can notice and fix the cracks.

Three analogies:

Sports: Don’t judge a soccer team only on practice drills—judge them under rain, noise, and a tough opponent.
Cooking: A chef isn’t just good in a perfect kitchen—also when an ingredient runs out and they must adapt.
Maps: It’s easy to follow a route when signs are clear; true skill is navigating when signs are faded or mismatched.

Before vs. After:

Before: Benchmarks assumed smooth talk, shared norms, and calm moods. Success = do the task.
After: SocialVeil adds structured barriers, checks whether agents can keep the conversation healthy, reduce confusion, and align intentions—social success beyond just finishing.

Why it works (the intuition):

Barriers aren’t random noise—they are patterned ways real talk breaks (vague words, mismatched styles, intense feelings). By injecting one barrier into one agent, we get controlled, reproducible stress that still feels real. Measuring both goals and the health of understanding reveals where social reasoning fails even when tasks sometimes still get done.

Building blocks:

Barrier taxonomy: Three cognitive barrier types drawn from social science research.
Two-layer barrier design: A high-level style prompt (like “overuse pronouns”) plus numeric constraints (how often, which tactics) for reproducibility.
Asymmetric setup: Only one agent is barrier-affected, the partner is normal. This mirrors real life and keeps tests controlled.
Barrier-aware metrics: Add Unresolved Confusion and Mutual Understanding to standard scores (goal, relationship, knowledge, believability).
Human validation: People agree the barriers feel real and the scores make sense.

🍞 Hook: Ever tried to solve a problem with a teammate who keeps changing how they talk?

🥬 The Concept (Barrier-aware Evaluation Metrics):

What it is: New scores that check how well a conversation heals confusion and reaches shared understanding, not just whether a task ends.
How it works:
1. After the chat, rate Unresolved Confusion from 1 (a mess) to 5 (fully clear).
2. Rate Mutual Understanding from 1 (talking past each other) to 5 (fully aligned).
3. Compare alongside traditional scores: goal, relationship, knowledge, believability.
Why it matters: Without these, an agent could “win” the task while leaving the partnership confused or damaged. 🍞 Anchor: A customer gets a refund (task done) but still doesn’t understand what went wrong. Confusion remains high; mutual understanding is low.

🍞 Hook: Picture a science fair where multiple robots collaborate at different tables.

🥬 The Concept (Multi-agent Systems):

What it is: Multi-agent systems are teams of AI agents that talk and work together.
How it works:
1. Each agent has private goals.
2. They exchange messages and adapt to each other.
3. Group performance depends on communication quality.
Why it matters: Real applications often need teamwork (support triage, tutoring groups, co-pilots), so we must test conversations between agents, not just solo skills. 🍞 Anchor: Two AI assistants plan an event together—one handles budget, the other scheduling. Team success depends on their messages making sense to each other.

03Methodology

High-level recipe: Input (role-play scenario with private goals) → Barrier Injection (one agent gets a structured barrier style) → Interaction Simulation (multi-turn conversation) → Evaluation (task scores + barrier-aware scores) → Optional Adaptation (repair instruction or interactive learning) → Re-evaluation.

Step 1: Scenario setup and neutralization

What happens: The authors adapt 720 role-play episodes from a prior benchmark (SOTOPIA). Each episode gives two agents a shared public setting but private goals. A neutralization step rewrites scenario descriptions to remove hints that might leak a partner’s goal.
Why it exists: If hints leak, the conversation becomes too easy and unrealistically tidy.
Example: Original might say “Alex must convince Jamie to lower rent.” Neutralized: “Two roommates review monthly expenses,” hiding the specific aim.

Step 2: Asymmetric barrier injection

What happens: Only one agent (the barrier agent) gets a special two-layer prompt that forces a specific disruption style. The partner agent stays normal.
Why it exists: Asymmetry mirrors real life (one person might be vague or upset, the other not), and it keeps experiments controlled—you know which side the disruption comes from.
Example: The barrier agent for semantic vagueness overuses pronouns and ellipses (“it… that… you know”), makes indirect references, and withholds confirmations.

Step 3: Two-layer barrier design

What happens: Each barrier has (a) a style prompt (high-level directive) and (b) parameterization that controls four dimensions—narrative stance, interaction tactics, confusion mechanisms, and exemplar templates. This creates realistic yet repeatable behavior.
Why it exists: Purely free-form prompts can drift, while simple noise feels fake. The two-layer plan balances realism with consistency.
Example: Emotional interference sets an emotion-focused stance, asks for more self-focus pronouns (“I, my”), pushes stronger sentiment, and limits precise task details.

Step 4: Multi-turn interaction

What happens: Agents take turns (up to 20 turns). Each generates its utterance based on the chat history and its private goal. Only the barrier agent follows the barrier style.
Why it exists: Real misunderstandings need back-and-forth chances to grow, be noticed, and (ideally) be repaired.
Example: Partner tries to clarify, “Do you mean the cleaning schedule or the budget?” The vague agent replies, “That thing… we can sort it out later.”

Step 5: Evaluation protocol

What happens: Measure classic social scores—Goal Completion (0–10), Relationship quality (–5 to +5), Knowledge (0–10), Believability (0–10)—and add two barrier-aware scores: Unresolved Confusion (1–5; higher is clearer) and Mutual Understanding (1–5; higher is more aligned).
Why it exists: Tasks can be finished while understanding and trust crumble. These extra scores shine a light on the invisible cracks.
Example: A chat that reaches a plan but leaves key details fuzzy gets a decent goal score but low mutual understanding and confusion scores.

Step 6: Validation with humans and representation probes

What happens: Human judges rate barrier type, confusion, and understanding; their ratings agree strongly with automatic scores. The authors also checked hidden model states and found distinct clusters for each barrier, suggesting barriers produce structured, not random, shifts.
Why it exists: To show the barriers feel real to people and are encoded systematically in the model’s internal signals.
Example: Humans could correctly identify barrier types above chance (~68% overall), and their scores correlated with the evaluator (Pearson r about 0.80).

🍞 Hook: When you break a vase, superglue instructions alone don’t always fix it—you need practice handling real cracks.

🥬 The Concept (Repair Instruction):

What it is: A set of guiding tips added to the partner agent’s prompt, like “ask clarifying questions and paraphrase to confirm.”
How it works:
1. Give explicit advice in the meta-prompt.
2. Agent tries to follow it during the chat.
3. Hopes to reduce misunderstandings.
Why it matters: It’s a quick fix to encourage better habits—but might be too shallow for deep, shifting problems. 🍞 Anchor: The agent keeps saying, “Just to confirm, you mean the budget?” But if the other person stays vague or emotional, the chat still stalls.

🍞 Hook: You learn to ride a bike not by reading, but by wobbling, adjusting, and trying again.

🥬 The Concept (Interactive Learning):

What it is: Training agents through example conversations and their own trial-and-error to gradually learn barrier-handling strategies.
How it works:
1. Behavior Cloning: Gather strong demonstration chats (using a strong partner and an evaluation filter) and train the agent to imitate good moves.
2. Self-Reinforcement: Let the trained agent practice with the barrier agent, rate outcomes, keep the good parts, and repeat.
Why it matters: Real-time practice helps agents learn nuanced repairs beyond canned advice. 🍞 Anchor: After many chats with a culturally indirect partner, the agent learns to gently rephrase and confirm intent without sounding pushy.

Secret sauce highlights:

Unilateral barriers: Only one side is disrupted, keeping cause-and-effect clear.
Two-layer prompts: Combine realism with reproducibility.
Barrier-aware metrics: Don’t let task success hide social fractures.
Human alignment: People confirm that the barriers and scores match real conversation patterns.

Concrete data example:

Suppose in a cultural mismatch case, an agent says, “We’ll think about it,” meaning no. The partner continues planning as if it’s a yes. SocialVeil’s evaluator gives: Goal moderate (some planning happened), Relationship okay (no overt conflict), Knowledge limited (key decision unclear), but Confusion high and Mutual Understanding low. This pattern matches the intended barrier.

04Experiments & Results

The test: The authors ran 720 episodes: 180 with no barrier and 180 each for semantic vagueness, sociocultural mismatch, and emotional interference. Four partner models were tested. Each chat ran up to 20 turns, then was scored on task and social metrics, plus the two barrier-aware metrics.

The competition: The same models were also evaluated in barrier-free cases for comparison. The key question: How much do barriers hurt? And do strategies like repair instruction or interactive learning help close the gap?

The scoreboard with context:

Big picture: Barriers hurt across the board. Mutual understanding fell by over 45% on average, and confusion jumped by nearly 50% across scenarios.
Distinct fingerprints: • Semantic vagueness: Worst hit to mutual understanding (about −58% on average). Without clear references, partners couldn’t align. • Emotional interference: Heaviest damage to relationships (about −49% on average). Feelings got in the way of connection. • Sociocultural mismatch: Persistent confusion (about −49% on average), even when other parts stayed less harmed.
Fragility of social reasoning: Goal completion and knowledge dropped moderately (about 20–30%), but social dimensions like relationship (≈ −45%) and mutual understanding (≈ −52%) dropped much more. So, it’s easier to “do” than to “relate” under stress.

Human validation and hidden-state probing:

Humans vs. automatic evaluator: Strong agreement—ICC around 0.77–0.79; Pearson correlations ≈ 0.80 for both confusion and mutual understanding. Judges could identify which barrier was used above chance (~68% overall accuracy).
Internal representations: The model’s hidden states formed tight, separate clusters for each barrier in t-SNE space, suggesting barriers introduced structured, not random, shifts.

Adaptation strategies:

Repair instruction: Minor, often negligible gains. Why? Shallow advice doesn’t detect which barrier is happening or how severe it is.
Interactive learning: Consistent but modest improvements (about 10–20%). Helpful, but still far below no-barrier performance.
Trade-off hint: These adaptations didn’t boost goal scores much, suggesting the agent’s effort is spent on damage control rather than pushing task objectives.

Surprising findings:

You can “finish a task” while still misunderstanding your partner. Without barrier-aware metrics, you might miss the hidden cost.
Simple “be clearer” prompting doesn’t cut it. Real repair needs recognizing which barrier is active and responding with the right move at the right time.
Tone matters: Positive sentiment correlated with smoother interactions and better goals, while overuse of vague or self-focused language correlated with more confusion and worse alignment.

05Discussion & Limitations

Limitations:

Text-only: Real conversations use voice, facial expressions, and gestures. Barriers like sarcasm or emotional tone often travel through non-verbal cues.
Short episodes: Many misunderstandings build up or get repaired over longer relationships; here, chats are one-off.
Post-hoc focus: The main scores are about how the conversation ended, not whether agents proactively prevented barriers.
Narrow barrier set: Three core types are covered; real life has mixes and other forms (like attention lapses or power dynamics) that could be added.

Required resources:

A capable evaluator model to score chats consistently (e.g., GPT-class models) and compute barrier-aware metrics.
Access to open-weight or proprietary LLMs to act as agents; moderate GPU for fine-tuning if trying interactive learning.
Scenario bank and prompts for stable, reproducible barrier injection.

When not to use:

If you only need single-turn facts (like “What’s the capital of France?”), these social tests may be overkill.
If the application forbids any ambiguity (e.g., critical safety commands), you may prefer hard constraints over repair-oriented training.
If multimodal signals (voice tone, gaze) are central, text-only barriers could miss key dynamics.

Open questions:

Detection: How can agents reliably detect which barrier is happening in real time?
Strategy selection: Can we learn a “repair policy” that chooses the right move—clarify, reframe, empathize—based on the barrier type and severity?
Long-term learning: Do agents improve rapport over weeks or months, remembering past breakdowns and adjusting earlier?
Multimodal extension: How do we inject and measure barriers across voice and vision channels?
Fairness: Do repair strategies work equally well across cultures and languages, or do they need localization?

06Conclusion & Future Work

Three-sentence summary: SocialVeil is a new environment that tests AI chat agents under realistic communication barriers, not just perfect talk. It adds three barrier types (vague words, cultural mismatch, and strong emotions) and measures not only tasks but also confusion and mutual understanding. Results show big drops in social reasoning under barriers, modest gains from interactive learning, and strong human alignment with the evaluation.

Main achievement: Turning “messy, real conversation problems” into a controlled, repeatable, literature-backed testbed with barrier-aware metrics that expose hidden weaknesses in AI social intelligence.

Future directions:

Add multimodal barriers (voice tone, facial cues) and longer, evolving relationships.
Teach agents to detect barrier types on the fly and pick tailored repair strategies.
Use SocialVeil not just for testing but as a training ground for socially robust agents that improve over time.

Why remember this: Great AI isn’t only about giving correct answers; it’s about building shared meaning with others when the road is bumpy. SocialVeil shows how to test—and eventually train—agents to notice, navigate, and mend misunderstandings so conversations succeed both in outcome and in understanding.

Practical Applications

•Customer support bots that detect vagueness and ask targeted clarifying questions before offering solutions.
•Healthcare intake assistants that gently manage emotional interference and confirm mutual understanding of symptoms and next steps.
•Educational tutors that adapt to students’ cultural communication styles and paraphrase instructions to reduce confusion.
•Workplace copilots that sense misalignment in team chats and summarize agreed points to restore shared context.
•Negotiation assistants that identify indirect refusals and reframe options without escalating conflict.
•Onboarding chatbots that spot unresolved questions and proactively propose clarifying checklists.
•Accessibility tools that flag ambiguous phrases (“that, it, soon”) and suggest clearer rewrites.
•Moderation helpers that recognize emotionally heated turns and insert cooling, empathic prompts to keep discussions productive.
•Agent teams (multi-agent systems) that self-check alignment after each turn and trigger repair steps when understanding drops.
•Training pipelines that use SocialVeil-style episodes to fine-tune agents for barrier detection and targeted repair strategies.

Version: 1