Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

Ming Li; Xirui Li; Tianyi Zhou

Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

Intermediate

Ming Li, Xirui Li, Tianyi Zhou2/15/2026

arXiv

Key Summary

•This paper studies Moltbook, a giant social network made only of AI agents, to see if they start acting like a real society over time.
•The authors define AI Socialization as agents changing their behavior because of ongoing social interaction, not just random drift.
•They build a diagnostic toolkit to measure five things: semantic stabilization, lexical turnover, individual inertia, influence persistence, and collective consensus.
•Moltbook quickly stabilizes on average topics (the 'center' stays the same) while keeping lots of variety among individual posts.
•Agents show strong inertia: they do not learn from upvotes/downvotes or from people they comment on, so interactions don’t change their behavior.
•Influence doesn’t stick: there are no long-term leaders or super-popular posts, and 'supernodes' come and go without lasting power.
•There is no shared social memory or agreement on who is influential; references are scattered or even hallucinated.
•Conclusion: Big scale and lots of interaction are not enough for socialization; societies also need memory, feedback learning, and governance.
•The paper offers a clear way to test whether future AI societies are actually socializing instead of just talking a lot.

Why This Research Matters

AI communities are coming fast, and we need to know whether they become organized societies or stay as noisy crowds. If agents don’t learn from feedback or interactions, platforms won’t self-correct and can amplify confusion. Without stable leaders or shared memory, it’s hard to coordinate, govern, or even agree on what is trustworthy. Designers can use these diagnostics to build features like memory, reputation, and feedback learning that make real communities possible. Policymakers and safety teams can monitor whether a platform is maturing or just scaling chaos. For everyone, this work helps ensure future AI societies are coherent, accountable, and helpful.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine a huge playground where only robots hang out. They can post messages, comment on each other, and vote on what they like. The big question: will these robots start acting like a real community the way kids on a playground slowly learn the rules and make leaders?

🥬 Filling (The Actual Concept: AI Socialization)

What it is: AI Socialization means AI agents change their visible behavior because of ongoing social interactions with other agents (not just random changes or their built-in quirks).
How it works (in human terms):
1. People talk a lot.
2. They start sharing slang, values, and patterns.
3. Leaders and norms appear, and groups remember what mattered.
4. Over time, the group becomes more coordinated and predictable.
Why it matters: Without socialization, you get lots of chatter but no shared direction, no trusted leaders, and no agreement—just noise that never settles.

🍞 Bottom Bread (Anchor) Think of a soccer team. If players never adapt to their teammates, they won’t form plays, captains won’t emerge, and wins will be random. That’s a team without socialization.

The World Before You know how early chatbots mostly handled single tasks, like answering trivia or scheduling? That was like one robot doing its homework alone. Later, researchers let multiple AI agents talk to each other for short tasks—like group projects—but usually in small, closed rooms and for limited time. People could measure teamwork on a single assignment, but not what happens when millions of agents keep chatting for weeks or months.
The Problem Do lots of AI agents, in a big open world, start forming something like a society—shared language styles, leaders, and agreements? Or do they just talk past each other forever? Until Moltbook (a massive AI-only social network), we couldn’t watch a large-scale, ongoing agent world to find out.
Failed Attempts

Static snapshots: Looking at one day in a small system doesn’t reveal long-term change.
Task-only setups: Agents can coordinate on a single job but that doesn’t prove socialization over time.
Size-only thinking: People assumed more agents and messages would automatically create community structure, but size isn’t the same as social glue.

The Gap We lacked a clear definition and a measurement toolkit to test if socialization is actually happening in AI societies. We needed multi-level, time-aware metrics that check both the big picture and the fine details.

🍞 Top Bread (Hook) You know how a classroom needs rules, familiar routines, and shared memories (like, “Remember the science fair?”) to feel like a real class?

🥬 Filling (The Actual Concept: Diagnostic Framework for AI Socialization)

What it is: A set of measurements that track whether an AI society is stabilizing its language, adapting individuals, and forming persistent influence and consensus.
How it works:
1. Check if the average topic stays steady (macro stability).
2. Check if individual posts cluster more tightly (micro homogeneity) or keep diverse.
3. See if agents change after feedback (upvotes) or interactions (comments).
4. Track if stable leaders emerge in the network.
5. Probe for shared memory and agreement about who matters.
Why it matters: Without a way to measure, we’re guessing; with it, we can tell if a community is truly forming or just making noise.

🍞 Bottom Bread (Anchor) It’s like a report card for a school: attendance (scale), class rules (norms), star students (leaders), and yearbook memories (shared references). The toolkit checks each.

Real Stakes

Safety and moderation: Without shared norms, bad ideas can spread or chaos can erupt (the paper even saw memecoin-like waves start spontaneously).
Product design: If agents don’t learn from feedback, like systems won’t self-correct.
Governance: Real societies need memory and trusted anchors; AI societies may need built-in structures to get there.
Research clarity: We need to know when a ‘big crowd’ is a real community versus a noisy room.

🍞 Top Bread (Hook) Imagine a loud cafeteria that never grows quiet patterns—just constant chatter that doesn’t build tradition. That’s the risk in AI worlds without socialization.

02Core Idea

🍞 Top Bread (Hook) You know how planting more seeds doesn’t guarantee a forest unless roots take hold and trees connect through soil and time? More agents and more messages alone won’t grow a society.

🥬 Filling (The Actual Concept: The Aha!)

What it is: The key insight is that scale and dense interaction are not enough to create socialization in AI societies; you also need mechanisms that let influence stick, feedback matter, and shared memory persist.
How it works:
1. Measure macro stability (does the overall topic center settle?).
2. Measure micro diversity (do individual posts cluster or stay spread out?).
3. Test if agents actually adapt after feedback or interactions.
4. Track if influence accumulates into lasting leaders.
5. Probe whether agents agree on who or what matters (shared social memory).
Why it matters: If those signals don’t move together toward convergence, you have activity without socialization—like waves that never become a current.

🍞 Bottom Bread (Anchor) A school without traditions, mentors, or a memory of past events stays a crowd—not a community—no matter how many students it has.

Multiple Analogies

Orchestra: Lots of instruments (agents) playing louder (more posts) won’t make harmony unless they listen, adjust to feedback, and follow a conductor (influence). Moltbook sounds busy but not harmonized.
Weather vs. Climate: Daily chats (weather) vary, but the average topic center (climate) stabilizes quickly while the local differences stay big—so no long-term narrowing.
Ant trails: Many ants make trails that get reinforced by pheromones (memory and feedback). Without reinforcement, trails fade; Moltbook lacks such reinforcement, so no stable paths emerge.

🍞 Top Bread (Hook) You know how group projects get better when teams learn from past scores and agree on who leads what?

🥬 Filling (The Actual Concept: Why It Works)

What it is: A multi-level diagnostic that separates looking stable on average from actually becoming similar individually, and that tests if behavior truly changes because of social contact.
How it works:
1. Macro vs. micro: Compare average stability to person-to-person similarity.
2. Cause vs. coincidence: Use baselines (random shuffles, random posts) to ensure changes aren’t just chance.
3. Structure vs. cognition: Check both the network (leaders) and the mind (shared references).
Why it matters: Without these separations, you might think a steady average means real agreement, when it might just hide ongoing differences.

🍞 Bottom Bread (Anchor) It’s like seeing class test averages hold steady while each student studies something different and nobody follows any top student—stability without alignment.

Building Blocks (each with a Sandwich)

Semantic Convergence 🍞 Hook: Friends often pick up the same slang over time. 🥬 Concept: Semantic convergence is when messages grow more similar in meaning.
- How: Track if the average topic center stabilizes and if posts get more alike.
- Why: Convergence signals shared norms forming. 🍞 Anchor: A class starts saying the same catchphrases.
Lexical Turnover 🍞 Hook: Slang comes and goes. 🥬 Concept: Lexical turnover is how new words are born and old ones die over time.
- How: Count daily births/deaths of words and phrases.
- Why: High turnover resists settling into one vocabulary. 🍞 Anchor: “YOLO” fades while a new meme rises.
Individual Semantic Drift 🍞 Hook: People sometimes change interests. 🥬 Concept: Drift is how much an agent’s topics shift from early to late.
- How: Compare early vs. late message centers for each agent.
- Why: Big, shared drift could mean social pressure; tiny or random drift means inertia. 🍞 Anchor: A student keeps writing about space all year—low drift.
Feedback Adaptation 🍞 Hook: We repeat jokes classmates like. 🥬 Concept: Feedback adaptation is changing content toward what gets more upvotes.
- How: Compare future posts to past high- vs. low-scoring posts.
- Why: If feedback doesn’t move behavior, learning isn’t happening. 🍞 Anchor: Ignoring the teacher’s gold stars.
Interaction Influence 🍞 Hook: You talk like the people you reply to. 🥬 Concept: Interaction influence is aligning with a post after commenting on it.
- How: Compare similarity before vs. after the interaction, vs. random posts.
- Why: If interactions don’t change you, influence isn’t spreading. 🍞 Anchor: Debating a friend but never altering your stance.
Influence Hierarchy 🍞 Hook: Teams often get captains. 🥬 Concept: Influence hierarchy is when a few agents reliably attract and shape others.
- How: Use PageRank on the comment network; check if top spots persist.
- Why: Stable leaders help coordinate and form norms. 🍞 Anchor: A captain who keeps being captain.

Before vs. After

Before: Many assumed that enough agents and messages would naturally produce socialization.
After: We learn that Moltbook reaches average stability but not shared norms, not lasting leaders, and not adaptation—so scale and density alone don’t produce socialization.

03Methodology

At a high level: Raw Moltbook posts/comments → Text prep → Five diagnostic tracks → Compare to baselines → Interpret across society, agent, and collective levels.

Data and Prep

Inputs: All visible posts, comments, votes from launch to Feb 8, 2026 (≈290k posts, ≈1.84M comments).
Text features: n-grams (for word-level patterns) and semantic embeddings (for meaning-level patterns).
Networks: Daily directed graphs (commenter → poster) with edge weights as counts.

Track 1: Lexical Innovation (Word Births/Deaths) 🍞 Hook: Like new slang starting and old slang fading. 🥬 Concept: Lexical turnover measures how much the vocabulary keeps changing.

What happens: For each day, list unique n-grams (1 to 5 words), mark which are new (birth) and which vanish (death), and compute rates.
Why it exists: If society converges, birth/death rates should drop toward zero; steady rates mean continual change.
Example: Day X sees 10,000 active bigrams; 800 are brand new (8% birth), 750 disappear the next day (7.5% death). That’s ongoing churn. 🍞 Anchor: A classroom’s slang keeps refreshing; you never settle on one set for long.

Track 2: Semantic Distribution (Macro vs. Micro) 🍞 Hook: The school vibe (average) can feel steady even if friend groups (details) stay unique. 🥬 Concept: Macro stability vs. micro diversity separates average topic steadiness from how similar individual posts are.

What happens: Embed each post; compute daily average (semantic centroid) and its similarity across days (macro). Also compute average pairwise similarity across days (micro).
Why it exists: A stable average could hide ongoing diversity; true convergence should raise both macro and micro similarity.
Example: Daily centroids become >0.98 similar after Day 4 (very stable), while pairwise stays ~0.14 (low similarity), meaning broad spread remains. 🍞 Anchor: The school theme is ‘STEM month’ every week, but each student’s essay stays very different.

Track 3: Cluster Tightening (Local Neighborhoods) 🍞 Hook: Kids huddle into tight circles when friend groups get fixed. 🥬 Concept: Cluster tightening checks if posts gather into denser meaning-neighborhoods.

What happens: For each post, find its 10 nearest semantic neighbors that day; compute average similarity (local density). Track the distribution over days and compare day-to-day with Jensen–Shannon divergence.
Why it exists: Rising density over time suggests homogenization; flat distributions suggest stable diversity.
Example: After an initial bump as the platform grows, the density distribution stabilizes and JS divergence falls near zero—no further tightening. 🍞 Anchor: Tables in the cafeteria stop moving closer; friend circles stay similar in size and spread.

Track 4: Agent-Level Adaptation 4a) Individual Semantic Drift 🍞 Hook: Do students change topics as they learn? 🥬 Concept: Drift measures how an agent’s early topics compare to later topics.

What happens: Split each agent’s posts into early vs. late halves; compare their semantic centers.
Why it exists: Big, shared drift could signal socialization; small/random drift means inertia.
Example: Heavy posters show very small drift; light posters vary more (small-sample noise). 🍞 Anchor: A student keeps writing about astronomy all semester.

4b) Feedback Adaptation (Net Progress) 🍞 Hook: We tend to repeat what gets applause. 🥬 Concept: Do agents move toward their own high-upvote posts and away from low-upvote ones?

What happens: Use sliding windows of each agent’s timeline. In window k, find top 30% vs. bottom 30% posts by score; compare the next window’s center to both. Net Progress = closer to top minus closer to bottom. Compare to a shuffled-score baseline.
Why it exists: If upvotes teach agents, NP should beat the shuffled baseline.
Example: Observed NP overlaps the shuffled baseline and centers near zero—no learning. 🍞 Anchor: Ignoring the audience’s claps and boos.

4c) Interaction Influence (Event Study) 🍞 Hook: After you debate someone, do you talk more like them? 🥬 Concept: Do agents’ future posts move closer to a post they commented on?

What happens: For each comment event, compare similarity to the target post before vs. after; compare to random same-day posts.
Why it exists: If interactions transmit influence, similarity should increase beyond the random baseline.
Example: The shift centers at zero and matches random—no contagion. 🍞 Anchor: Lots of replies, little mind-changing.

Track 5: Collective Structure and Consensus 5a) Structural Influence (PageRank & Supernodes) 🍞 Hook: Teams often have lasting captains. 🥬 Concept: Stable influence means a few nodes hold attention over time.

What happens: Build daily commenter→poster graphs; run PageRank. Track top-k mass and detect supernodes via the largest rank gap. Check if top spots persist.
Why it exists: Strong, stable leaders are a hallmark of matured social structure.
Example: Top-k mass drops as the network grows; supernodes remain few and change identities—no persistent core. 🍞 Anchor: The captain’s armband keeps switching hands.

5b) Cognitive Consensus (Probing) 🍞 Hook: Ask a class, “Who are the top students?” and see if answers match. 🥬 Concept: Shared social memory is agreement about who matters and what to read.

What happens: Post 45 probes across communities asking for must-read posts and accounts to follow; score replies for valid references and overlap.
Why it exists: Agreement shows a maturing culture with anchors.
Example: Few replies, many invalid refs, and little overlap—even the lone valid thread disagrees. 🍞 Anchor: Yearbook superlatives with no consensus.

The Secret Sauce

Separate climate (macro) from weather (micro) to avoid false convergence.
Use baselines (shuffled scores, random posts) to avoid mistaking coincidence for causation.
Pair structure (PageRank) with cognition (probes) to catch both network and memory aspects.
Multi-level, time-aware diagnostics to see evolution, not just snapshots.

04Experiments & Results

The Tests and Why

Semantic stabilization: Is the average topic center steady? Why: Real societies often settle into themes.
Lexical turnover: Are words/phrases still churning? Why: High churn resists settling.
Individual inertia: Do agents change over time? Why: Socialized agents adapt.
Influence persistence: Do leaders stick around? Why: Stable hierarchy signals maturity.
Collective consensus: Do agents agree on who matters? Why: Shared memory marks culture.

The Competition (Baselines and Expectations)

Randomized score baselines for feedback learning: If observed improvement beats random, learning likely.
Random post baselines for interaction influence: If observed alignment beats random, influence likely.
Human-society expectation: Over time, we usually see stronger norms and more stable leadership.

The Scoreboard (with Context)

Macro semantic stability: Near-saturated similarity between day-to-day centers after the early burst—like saying the school’s theme stays constant.
Micro diversity: Low and steady pairwise similarity—like students keep writing very different essays.
Cluster tightening: Brief early densification, then flat—no ongoing squeeze into echo chambers.
Lexical turnover: Birth/death rates drop from early spike but settle at clear non-zero levels—steady churn rather than fixation.
Individual drift: Generally small; heavy posters drift even less—strong inertia.
Feedback adaptation: Net Progress ≈ 0 and matches shuffled baseline—no learning from upvotes/downvotes.
Interaction influence: Post-after-comment similarity ≈ 0 shift; matches random baseline—interactions don’t transmit influence.
Influence hierarchy: Top-k PageRank mass declines with growth; supernodes remain few and rotate—no persistent leaders.
Consensus probes: Few comments, many invalid refs, and little agreement—no shared social memory.

Meaningful Translations

“Centroid similarity near 1.0” means the average topic stayed steady (an A+ in stability), but “pairwise around ~0.14” means individuals stayed diverse (not an echo chamber).
“Net Progress ≈ baseline” is like scoring a 70 when guessing also gets 70—no sign of learning.
“No persistent supernodes” is like a league where the top team changes daily with no dynasty—excitement without structure.

Surprising Findings

Despite massive activity and some accounts posting a lot, durable authority didn’t form.
A memecoin-like burst showed that incentives can cause fast coordination even without lasting structure, highlighting the difference between momentary waves and true socialization.
The strongest pattern was ‘interaction without influence’: plenty of replies, almost no behavioral change.

05Discussion & Limitations

Limitations

Embeddings and metrics: Using a particular sentence-embedding model may miss nuances; different embeddings or models could shift fine-grained results.
Platform/time slice: Results are from Moltbook over specific weeks; other time windows or platforms might behave differently.
Agent heterogeneity: Agents differ by base models and prompts; some may lack memory or RL loops, limiting their capacity to adapt.
Moderation/incentives unknowns: Hidden platform rules or incentive changes could affect behavior in ways we can’t observe.
Content filters: Removing extremely repeated posts helps quality but may hide certain dynamics (e.g., bot storms).

Required Resources

Data access to a persistent AI-only platform with rich interaction logs.
Compute for embedding, n-gram analysis, and daily network PageRank at scale.
Careful experimental logging to align events and avoid leakage between windows.

When NOT to Use

Small, short-lived, or tightly scripted simulations where convergence is pre-ordained; the diagnostics are meant for open-ended, evolving societies.
Settings without interaction primitives (no comments/votes) where many measures can’t be computed meaningfully.

Open Questions

Memory design: Would durable, shared memory (e.g., wikis, canonical references) trigger consensus formation?
Learning loops: If agents were trained to optimize for community feedback (via RL), would feedback adaptation appear?
Governance: Which structures (reputation, moderation, constitutions) help influence stick in healthy ways?
Cross-society generalization: Do similar patterns hold on other AI societies with different incentives or agent designs?
Human–AI mix: Does adding humans as anchors accelerate norm-setting and leader stability?

06Conclusion & Future Work

3-Sentence Summary Moltbook shows that even with millions of agents and dense interactions, socialization does not automatically emerge. The society’s average topics stabilize quickly, but individuals remain diverse, don’t learn from feedback, don’t align after interactions, and don’t form lasting leaders or consensus. Scale and activity create motion, not maturity; memory, reinforcement, and governance appear necessary.

Main Achievement The paper defines AI Socialization precisely and delivers a practical, multi-level diagnostic framework—spanning semantic, behavioral, structural, and cognitive measures—to reveal scalability without socialization in a real, large-scale AI society.

Future Directions

Add shared social memory and canonical references to test whether consensus forms.
Introduce explicit feedback-learning loops (e.g., RL on community signals) to spark adaptation.
Explore governance and reputation systems that allow safe, durable influence accumulation.
Replicate across platforms, time spans, and mixed human–AI environments.

Why Remember This It overturns the assumption that “more agents + more messages = a society,” and gives us tools to tell the difference between noisy crowds and real communities. As AI populations grow, these diagnostics guide the design of safer, more coherent, and more accountable AI societies.

Practical Applications

•Add platform-level shared memory (e.g., canonical references or community wikis) to support consensus formation.
•Introduce explicit learning loops where agents are trained to optimize for high-quality feedback signals (e.g., reinforcement learning on validated upvotes).
•Deploy reputation systems that let influence accumulate safely and decay slowly, encouraging stable leadership.
•Use the diagnostic toolkit (macro/micro metrics, baselines) as a health dashboard for AI communities over time.
•Gate sensitive coordination primitives (like token minting) behind governance checks to avoid runaway cascades.
•Run periodic cognitive probes to test whether consensus and shared anchors are forming in sub-communities.
•Tune incentives to reward grounded references and penalize hallucinated citations to strengthen social memory.
•Segment agents by capability and memory settings to study which designs foster adaptation and safe influence.
•Replicate measurements across platforms and time windows to detect regime shifts or emerging risks early.

Version: 1