SVBench: Evaluation of Video Generation Models on Social Reasoning

Wenshuo Peng; Gongxuan Wang; Tianmeng Yang; Chuanhao Li; Xiaojie Xu; Hui He; Kaipeng Zhang

SVBench: Evaluation of Video Generation Models on Social Reasoning

Beginner

Wenshuo Peng, Gongxuan Wang, Tianmeng Yang et al.12/25/2025

arXiv PDF

Key Summary

•SVBench is the first benchmark that checks whether video generation models can show realistic social behavior, not just pretty pictures.
•It is built from 30 classic psychology experiments covering seven skills like understanding goals, following gaze, and helping others.
•A four-agent, training-free pipeline turns each experiment into fair, video-ready prompts with easy/medium/hard versions.
•A vision–language model judge scores each generated video on five clear yes/no questions about social logic and visual faithfulness.
•Across eight models, top proprietary systems (like Sora2-Pro and Veo-3.1) did much better than others, but still failed on belief-based and multi-agent reasoning.
•The critic agent removes hint-giving words and adjusts social cues, making prompts fair and controllable in difficulty.
•Human studies show the automated judge’s scores line up with people’s judgments, especially on what counts as correct social reasoning.
•The benchmark reveals a big gap: models can simulate motion and scenes well but often miss why people act the way they do.
•SVBench helps researchers diagnose specific weaknesses (like joint attention or prosocial inference) instead of only measuring visual quality.
•This work pushes video generation toward more human-like understanding by testing the ‘why’ behind actions, not just the ‘what’ on screen.

Why This Research Matters

AI that only paints pretty motion can still feel rude, unsafe, or confusing. SVBench pushes video generation to understand why people act—so digital characters, robots, and assistants behave in human-friendly ways. This helps studios make believable stories, game worlds feel socially real, and educational tools “read the room.” It also supports safer interactions in public spaces, where spacing, turn-taking, and helping matter. By revealing specific weaknesses (like missing gaze-following), teams can fix gaps faster. As models improve on SVBench, we move closer to AI that respects social rules and supports people in everyday life.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how when you watch a cartoon, you can instantly tell who is sad, who needs help, and who is following someone’s point? Your brain understands the story behind the actions, not just the colors and movement.

🥬 The Concept: Social reasoning is the skill of figuring out why people act—what they want, what they think, and what feelings and rules guide them.

How it works: 1) Notice social cues (like gaze, gestures, and distance) 2) Guess goals and beliefs 3) Predict the next action that fits the situation 4) Check if actions follow social rules (like taking turns)
Why it matters: Without social reasoning, a scene might look real but feel wrong—characters won’t help when they should, won’t look where others point, and won’t follow basic manners.

🍞 Anchor: Imagine a girl crying near a dropped ice cream. Most people expect a nearby adult to comfort her. That’s social reasoning at work.

🍞 Hook: Imagine a world where computers can draw believable videos—smooth motion, pretty lighting—but don’t understand the story.

🥬 The Concept: Before this paper, video generation benchmarks mostly tested looks and physics (Is it sharp? Is motion smooth? Does it obey gravity?).

How it works: Tools like FVD, VBench, and physics suites scored visual quality, consistency, and simple commonsense, often with automated checks.
Why it matters: These tests are great for “how it looks,” but they miss “why it happens,” so models could score well while failing social logic.

🍞 Anchor: A model could make a perfect scene of two people at a table but fail to show one passing the salt after the other points to it—socially odd, visually great.

🍞 Hook: Think about reading a friend’s mind just from eye direction or a tiny gesture.

🥬 The Concept: Developmental psychology paradigms are classic experiments that reveal how humans learn social skills like following gaze, helping, or taking turns.

How it works: 1) Set up a simple scene with clear cues 2) See if people, even toddlers, make the expected social move 3) Use results to define core abilities
Why it matters: These paradigms are proven, interpretable tests we can borrow to check if AI shows similar social logic in videos.

🍞 Anchor: In “pointing comprehension,” a person points at a box; even infants look where the finger leads—proof that pointing carries social meaning.

🍞 Hook: Picture two checklists: one for appearance (bright colors, stable frames) and one for behavior (sharing, cooperating, attending). Until now, most AI video tests used only the first list.

🥬 The Concept: The problem this paper tackles is the missing second list: there was no standard way to see if generated videos show believable social behavior.

How it works: The authors design a social benchmark firmly tied to psychology, turn experiments into prompts, and grade outputs along distinct social dimensions.
Why it matters: Without this, we couldn’t tell whether a model truly understands social cues versus just animating humans moving around.

🍞 Anchor: A model that makes a crowd scene with perfect lighting but has everyone cutting the line is failing social sense, even if it aces visual checks.

🍞 Hook: Imagine wanting to teach a robot to be polite, helpful, and aware of others’ views, not just to walk smoothly.

🥬 The Concept: SVBench fills the gap by evaluating seven core social skills: mental-state inference, goal-directed action, joint attention/perspective, social coordination, emotion/prosocial behavior, social norms/spacing, and multi-agent strategy.

How it works: 1) Choose classic experiments for each skill 2) Turn them into neutral, short, video-ready prompts 3) Score generated clips with a structured judge
Why it matters: Now we can see exactly which social skills models have or lack—and track progress fairly.

🍞 Anchor: If a model passes “turn-taking” but fails “gaze following,” we know where to improve training data or model design.

🍞 Hook: Think of trying to grade an art project for teamwork without a fair rubric—it’s guesswork.

🥬 The Concept: Previous attempts to measure higher-level understanding in video focused on recognition (answering questions about human-made clips), not generation (creating social scenes from scratch).

How it works: Recognition tasks test whether models can describe what they see; this work tests whether models can create socially coherent interactions on their own.
Why it matters: Generation is harder: the model must choose actions that fit social logic, not just label them.

🍞 Anchor: It’s easier to say “They’re lining up” than to make a new video where everyone lines up naturally without being told how.

Together, the background shows the world before (great visuals, weak social sense), the problem (no benchmark for the ‘why’), failed attempts (visual-only metrics, recognition-only tests), and the gap SVBench fills (psychology-grounded, generation-focused social evaluation). The real stakes touch everyday life: safer robots and avatars, more believable movies and games, better tutoring systems that “read the room,” and AI that respects social norms.

02Core Idea

🍞 Hook: You know how a good referee doesn’t just watch the ball—they watch the players to see if they’re playing fairly and following the game’s logic?

🥬 The Concept: The key insight is to evaluate video models on social reasoning by converting classic psychology experiments into neutral, short, and cue-controlled prompts, then grading the results with a structured vision–language judge.

How it works: 1) Distill each experiment’s social logic 2) Generate multiple concrete scenarios without revealing the “answer” 3) Adjust cues to make easy/medium/hard versions 4) Score with five yes/no checks about social correctness and visual faithfulness
Why it matters: This shifts evaluation from “Did it look nice?” to “Did the social story make sense?”, revealing strengths and weaknesses that visuals alone hide.

🍞 Anchor: Like a science fair rubric, the judge checks if the method (social logic) matches the hypothesis (experiment idea)—not just whether the poster looks pretty.

Multiple analogies:

🍞 Hook: Imagine baking cookies. Good cookies need more than a shiny tray—they need the right recipe steps in the right order. 🥬 The Concept: SVBench checks if models follow the social recipe (gaze leads attention, needing help invites helping) rather than just arranging pretty ingredients (people, props). 🍞 Anchor: If the prompt implies pointing should guide attention but the video shows a chat instead, the “recipe” wasn’t followed.
🍞 Hook: Think of a driving test. It’s not about the car’s paint—it’s about signaling, yielding, and reading others’ intentions. 🥬 The Concept: This benchmark is a social driving test for videos: Did the agent look before moving? Did it take turns? Did it keep safe distance? 🍞 Anchor: A smooth video of lane changes still fails if the “driver” never signals.
🍞 Hook: Consider a magic trick: Noticing where the magician looks gives away what will happen next. 🥬 The Concept: Social cues (gaze, pointing, posture, spacing) are the hidden strings; SVBench tests whether models can use them to pull the right behavior. 🍞 Anchor: When someone points behind you, you turn; a model that ignores pointing misses the trick.

Before vs After:

🍞 Hook: Picture grading only spelling in an essay and ignoring whether the story makes sense. 🥬 The Concept: Before SVBench, models could pass with perfect “spelling” (visuals) even if their “story” (social logic) was nonsense; after SVBench, both are graded separately. 🍞 Anchor: Now a crisp video that violates queuing or ignores helping cues won’t slip by.

Why it works (intuition):

🍞 Hook: You know how board games are easiest to judge when rules are clear and moves are discrete? 🥬 The Concept: Using classic, well-defined experiments turns fuzzy social expectations into clear rules; binary questions (yes/no) remove scoring wobble. 🍞 Anchor: “Did gaze-following happen?” is clearer than a 0–100 “socialness” score.

Building blocks:

🍞 Hook: Imagine assembling a team: planner, writer, editor, and judge. 🥬 The Concept: Four agents do the job—Experiment Understanding Agent (extracts core logic), Prompt Synthesis Agent (writes neutral, concrete scenes), Critic Agent (removes leaks and tunes difficulty via cues), and Evaluation Agent (VLM judge with five binary checks: core paradigm, prompt faithfulness, social coherence, cue effectiveness, and general plausibility). 🍞 Anchor: For a helping test, the planner defines “unmet goal invites helping,” the writer describes reaching and failing, the critic hides the answer and sets cue levels, and the judge checks if helping actually appears.

Altogether, the “aha!” is turning trusted human social-science tests into fair, bite-sized video tasks with controllable cues and a stable, interpretable scoring scheme—so we finally measure the ‘why’ behind actions, not just the ‘what’ on screen.

03Methodology

High-level flow: Input (psychology experiments) → Experiment Understanding Agent → Prompt Synthesis Agent → Critic Agent (neutrality + difficulty) → Video generation by models → Evaluation Agent (VLM judge) → Five binary scores and an overall average.

Seed Suite and Task Selection 🍞 Hook: Imagine picking mini-games that each teach one key skill. 🥬 The Concept: The authors start with 30 classic social reasoning experiments covering seven dimensions (mental states; goals; joint attention/perspective; coordination; emotion/prosocial; norms/spacing; multi-agent strategy), then select 15 that fit 5–10 second clips.

How it works: 1) Map each paradigm to its social skill 2) Decide if the logic can be shown in a short scene with visible cues 3) Keep short-form tasks now, save long-horizon ones for future models
Why it matters: If a task needs many stages (like deception across scenes), today’s short videos can’t show it clearly, causing unfair failures. 🍞 Anchor: “Detour reaching” (walk around an obstacle) fits a short clip; “false belief” (multi-scene belief change) doesn’t.

Experiment Understanding Agent 🍞 Hook: Think of a science teacher who extracts the main idea from a long story. 🥬 The Concept: This agent turns each experiment into a structured blueprint: description, key concepts, test point (what skill is tested), and ground truth (expected behavior).

How it works: 1) Read experiment 2) Identify the social mechanism (e.g., gaze→attention shift) 3) Define what success looks like behaviorally 4) Record constraints
Why it matters: Without a clear blueprint, later prompts might drift and accidentally include the answer or miss the real skill. 🍞 Anchor: For pointing comprehension, it notes: “Point indicates target; follower should orient and act toward indicated object.”

Prompt Synthesis Agent 🍞 Hook: Imagine a scriptwriter who makes scenes you can film right now. 🥬 The Concept: This agent creates video-ready, action-only prompts—no mind-reading words, only what a camera can see.

How it works: 1) Describe only visible actions (reaching, looking, walking) 2) Keep scenes 5–10s long 3) Pick concrete agents (e.g., adult, toddler) and objects (box, pen) 4) Separate scene description from outcomes
Why it matters: If prompts say “she decides to help,” the model is told the answer; we need neutral, observable setups to test real reasoning. 🍞 Anchor: “An adult drops a clothespin, stretches unsuccessfully, glances at it, and points; a toddler stands nearby”—no hint that helping must happen.

Critic Agent: Neutrality and Difficulty Control 🍞 Hook: Think of a careful editor who removes spoilers and sets the challenge level. 🥬 The Concept: The critic deletes interpretive language, blocks ground-truth leaks, and dials cues up or down for easy/medium/hard.

How it works: 1) Replace “realizes/decides/feels” with purely visible behavior 2) Compare prompt to the test point; if it reveals the answer, send edits back 3) Tweak cues—gaze angle, pointing clarity, occlusion—to change difficulty
Why it matters: Without this, prompts might be unfair (giving away answers) or too easy/hard, hiding true model ability. 🍞 Anchor: Easy: clear pointing + gaze + object in view; Medium: just pointing; Hard: subtle gaze, partial occlusion, or competing distractors.

Video Generation by Models 🍞 Hook: It’s showtime—different directors film the same script. 🥬 The Concept: Eight text-to-video models (proprietary and open-source) generate clips from the validated prompts.

How it works: Each model takes the same prompt pool, producing multiple short videos per task and difficulty.
Why it matters: A fair, apples-to-apples comparison reveals which social skills each model actually shows on screen. 🍞 Anchor: All models get the same “queue at a bus stop” prompt; some show polite spacing and order, others jumble people.

Evaluation Agent (VLM Judge) and the Five Dimensions 🍞 Hook: Imagine a referee with five simple yes/no questions per play. 🥬 The Concept: A high-capacity vision–language model (Gemini 2.5 Pro) scores each video along five binary dimensions:

D1 Core Paradigm Replication: Did the intended psychological effect appear?
D2 Prompt Faithfulness: Did the scene match the agents/objects/actions in the prompt?
D3 Social Coherence: Was the behavior causally and socially plausible?
D4 Social Cue Effectiveness: Were critical cues (gaze, gestures, spacing) rendered and used?
D5 Video Plausibility: Did the clip look stable and realistic enough to judge?
How it works: The judge reads minimal metadata and answers each D1–D5 as 0/1; the overall score is their average.
Why it matters: Binary checks cut noise, separate reasoning errors from rendering errors, and make failure modes interpretable. 🍞 Anchor: In a gaze-following prompt, a pretty but chatting-only clip gets D5=1, but D1–D4=0: it looks fine but fails the social test.

The Secret Sauce 🍞 Hook: You know how a good lab experiment isolates one variable at a time? 🥬 The Concept: Cue-controlled prompts (easy→hard) plus binary, dimension-wise scoring isolate social reasoning from surface quality and reveal which cues models truly use.

How it works: More/less gaze, pointing, and context systematically change task difficulty; separate D’s show where things break.
Why it matters: This turns a fuzzy problem (“be social!”) into crisp diagnostics (“fails D1 on joint attention when gaze is subtle”). 🍞 Anchor: A top model keeps passing even when pointing is removed (hard), proving it reads subtle gaze; a weaker model only passes the easy, over-cued version.

04Experiments & Results

🍞 Hook: Think of a school tournament with seven events—each testing a different social skill—and a clear scoreboard.

🥬 The Concept: The authors evaluate eight video generation models on 15 short-form tasks (from a 30-experiment suite), across seven social dimensions, and score each clip with five binary checks.

How it works: 1) For each task and difficulty level, models generate videos 2) The VLM judge rates D1–D5 3) Scores aggregate by task and dimension 4) Human spot-checks validate alignment with the automated judge
Why it matters: This setup measures whether the right social story appears, not just whether frames look good.

🍞 Anchor: In “pointing comprehension,” top models often show the follower turning to the pointed object, earning high D1 and D4.

The Competition and Models

Proprietary: Sora2-Pro, Veo-3.1, Hailuo02-S, Kling2.5-Turbo
Open-source: HunyuanVideo, Longcat-Video, LTX-1.0, Wan2.2

The Scoreboard (with context)

Overall performance (higher is better): Sora2-Pro ≈ 79.6% (A-level), Veo-3.1 ≈ 72.4% (strong B+/A−), Hailuo02-S ≈ 56.4% (C), Kling2.5-Turbo ≈ 52.2% (C−), Wan2.2 ≈ 48.3% (D+), Longcat-Video ≈ 39.2% (D), HunyuanVideo ≈ 30.8% (F+), LTX-1.0 ≈ 27.6% (F).
By task highlights: Sora2-Pro and Veo-3.1 exceed 80% on many goal, joint attention, and prosocial tasks (e.g., pointing comprehension, joint engagement, empathic concern). Mid-tier proprietary models lag on multi-agent coordination (leader–follower) and perspective-based helping. Open-source models struggle broadly, especially on complex causal/belief tasks.
Interpreting 79.6%: That’s like getting an A on social logic when many peers hover around a C—still not perfect, but clearly more socially competent.

Surprising Findings

Difficulty reversal: For top models (Sora2-Pro, Veo-3.1), performance sometimes peaks at medium or hard, not easy. Why? Extra cues in easy prompts can conflict within a short 5–10s clip, hurting faithfulness or cue effectiveness; stronger models infer intent even with sparse cues.
Cue dependence: Weaker models (Hailuo02-S, Kling2.5-Turbo, and open-source group) benefit most from explicit cues—scores follow Easy > Medium > Hard, showing reliance on surface signals.

Human Alignment and Reliability

Human vs VLM judge: On a 160-clip sample (20 per model), trends match closely. Humans are more lenient on visual dimensions (D2/D4/D5) but stricter on reasoning (D1/D3). This suggests the automated judge is a solid proxy, especially for detecting logical correctness.

Pipeline Validation

Prompt quality improves with each agent stage: Average human pass rates jump from ≈66.8% (no understanding) → 75.9% (understanding + synthesis) → 86.9% (full with critic). This shows the planning and editing agents are essential for clean, fair prompts.
Difficulty control works: For mid/weak models, scores step down from easy to hard, confirming cue-based modulation raises real challenge; for top models, robustness under sparse cues stands out, revealing deeper social inference ability.

Dimension-wise Patterns

Strong areas: Pointing comprehension, joint engagement, empathic concern, and emotion contagion show higher pass rates for top systems—these often rely on visible, well-known cues.
Weak areas: Intention recognition without overt cues, belief/perspective-based helping, leader–follower coordination, and consistent spacing/norms under crowd interactions—these require integrating subtle signals over time.

Overall, the results paint a clear picture: modern generators can animate plausible humans, but only a few begin to capture why those humans act as they do—and even those stumble when beliefs, subtle cues, or multi-agent strategies are required.

05Discussion & Limitations

🍞 Hook: Picture teaching a friend to play a team sport: they can run fast, but reading teammates and opponents is still hard.

🥬 The Concept: SVBench shows progress but also hard limits in today’s models’ social reasoning.

Limitations: 1) Short clips (5–10s) restrict complex stories like deception or long-term planning; half the paradigms are long-horizon and deferred. 2) The VLM judge, while aligned with humans, can still have biases, especially across cultures or camera styles. 3) Prompts capture common norms; regional or situational norms (e.g., queuing practices) may vary. 4) Training data may include social patterns unevenly, skewing results. 5) Evaluation is pass/fail per dimension; some near-misses are treated the same as clear misses.
Required resources: Access to multiple video generators, compute for batch generation, and a high-capacity VLM (e.g., Gemini 2.5 Pro) for judging; optional human validation for audits.
When not to use: 1) If you only need visual quality checks (use VBench-like tools). 2) If your model generates long narratives (await long-horizon SVBench extension). 3) If your domain uses non-standard or specialized norms (customize paradigms first).
Open questions: 1) How to extend to long, multi-scene social narratives? 2) How to fairly represent cultural diversity in norms and spacing? 3) Can we build training curricula from SVBench failures to teach missing skills? 4) How robust is judging across different VLMs? 5) Can we measure graded partial credit without reintroducing scoring noise?

🍞 Anchor: Think of SVBench like a fitness test: great for spotting which muscles (skills) are weak today, while planning a training program for marathons (long-horizon social stories) tomorrow.

06Conclusion & Future Work

Three-sentence summary:

SVBench is a psychology-grounded benchmark that evaluates whether video generation models can produce socially coherent behavior, not just realistic visuals.
A training-free, four-agent pipeline creates neutral, difficulty-controlled prompts and a VLM judge scores five interpretable dimensions per video.
Results across eight models reveal that even top systems still struggle with belief reasoning, subtle cue integration, and multi-agent coordination.

Main achievement:

Turning classic social-cognition experiments into fair, controllable, and scalable generation tasks with crisp, binary, dimension-wise evaluation—finally measuring the ‘why’ behind actions in generated videos.

Future directions:

Extend to long-horizon narratives (deception, multi-stage collaboration), diversify cultural norms, compare multiple VLM judges for robustness, and convert failure analyses into training curricula that teach missing social skills.

Why remember this:

SVBench marks a shift from judging appearance to judging social sense, offering a clear path for building video models that understand people—not just pixels.

Practical Applications

•Benchmark studio tools to ensure generated crowd scenes follow queuing and spacing norms before film or TV use.
•Stress-test game NPCs for joint attention and coordination so co-op play feels natural and responsive.
•Evaluate social robots’ interaction videos (helping, turn-taking) before real-world deployment in schools or hospitals.
•Screen ad concepts for socially coherent interactions (e.g., sharing, comforting) to avoid awkward or insensitive portrayals.
•Design training curricula: use failures on SVBench tasks to build fine-tuned datasets that teach missing social skills.
•Compare model updates: run SVBench regressions to ensure social reasoning doesn’t degrade when visual quality improves.
•Localize content: swap in culture-specific norms/spacing prompts to check regional appropriateness.
•Prototype assistants that follow gaze and pointing to control smart-home devices or AR interfaces.
•Audit safety: verify that generated videos avoid norm violations (e.g., cutting lines, unsafe crowding) in public-safety messaging.
•Research multi-agent strategy by probing leader–follower coordination and perspective-based helping scenarios.

Version: 1