AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking
Key Summary
- •AVMeme Exam is a new test made by humans that checks if AI can understand famous internet audio and video clips the way people do.
- •It doesn’t just ask, “What did you hear?”—it asks if the AI knows the feeling, joke, meaning, and cultural use behind the clip.
- •Researchers collected 1,032 iconic clips across many languages and sound types (speech, song, music, and sound effects), and wrote unique questions for each.
- •They removed easy shortcuts so models can’t guess answers just from text or obvious on-screen hints, making it a fairer test of real listening and watching.
- •Across many top models, scores were high on plain speech tasks but dropped a lot for culture, context, and textless sounds like music or sound effects.
- •Even the best models struggled more with lesser-known languages and with questions that require knowing how people actually use a meme.
- •Extra “thinking” sometimes helped for basic recognition, but didn’t consistently improve culture-and-context questions.
- •Humans still hold an edge, especially on clips they already know, showing that cultural grounding remains a big challenge for AI.
- •This benchmark shines a light on what today’s multimodal AIs miss and points the way toward models that better understand people’s feelings, habits, and cultures.
Why This Research Matters
People communicate tons of meaning with tiny sounds, melodies, and short video moments, not just words. AVMeme Exam shows clearly that today’s strongest models still miss much of the cultural and contextual heart of these clips. This matters for tools that should feel human-aware—assistants, accessibility features, content moderation, and creative aid. When AI understands usage and emotion, it can better support creators, educators, and everyday users across cultures and languages. The benchmark also protects fairness by blocking shortcuts, so progress reflects real perception and reasoning. By revealing exactly where models fail—like textless audio and lesser-resourced languages—it guides researchers toward smarter, more inclusive multimodal AI.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you show a friend a five‑second video of a door slam and a funny “fail” sound. Without any words, your friend laughs and says, “That’s the Windows error sound!” People instantly get meaning from tiny sounds, music, and quick video moments—because we share context and culture. 🥬 The Concept: Before this research, AIs were getting good at reading text and recognizing pictures, but they often stumbled when meaning lived in sound and timing—like sarcasm in a voice, suspense in music, or a one‑second sound effect that billions instantly recognize. How it worked before: 1) Many tests asked AIs to caption images, transcribe speech, or spot objects in a video. 2) Some audio tests checked if AIs could hear words or identify a sound. 3) A few meme tests mixed text and pictures, but not much with moving sound and timing. Why this mattered: Without tests that check feelings, inside jokes, and culture, we don’t know if an AI understands people—or just the surface of words. 🍞 Anchor: Think of the song clip that suddenly “Rickrolls” you. Humans know the joke because of internet history. A plain caption like “A man sings in a music video” totally misses the point.
🍞 Hook: You know how hearing just two notes can tell you what’s coming in a movie? Music and sound effects carry meaning even without words. 🥬 The Concept: Audio‑visual memes are short, iconic clips—movie lines, punchy sound effects, music riffs, and viral edits—that people reuse to communicate feelings and ideas. How they work: 1) They rely on delivery (tone, melody, pacing), 2) they carry shared cultural meaning (we’ve seen them before), 3) they are used in new situations to signal a mood or message (like playful denial with “You shall not pass!”). Why it matters: If AI can’t catch these signals, it can’t truly “get” how people communicate online. 🍞 Anchor: When a student jokes “You shall not pass!” before a tough exam, humans read humor and context; a surface‑only AI might think it’s about bridges and wizards.
🍞 Hook: Picture a teacher asking not just “What’s on the board?” but also “Why is the class laughing?” and “When would you use this joke?” 🥬 The Concept: Past benchmarks mostly checked surface understanding—what’s in the frame or what words were spoken. They didn’t test deeper layers like emotion, usage, or world knowledge. How it played out: 1) Audio benchmarks focused on recognition and captions; 2) audio‑visual ones checked events, order, and alignment; 3) language‑image meme tests probed some culture, but mostly with static images. What broke: A model could ace speech transcription but fail to understand why a sound effect signals failure, why a song feels triumphant, or how a clip is used as a reaction. 🍞 Anchor: A model might read subtitles perfectly but not realize a “dun‑dun‑duuun” sting means dramatic surprise.
🍞 Hook: Imagine a quiz bowl where some questions accidentally contain the answers in the question itself. That wouldn’t be fair! 🥬 The Concept: The world needed a benchmark that blocks easy shortcuts so the AI must truly listen and watch. How this paper fills the gap: 1) Human experts from different cultures handpicked 1,032 iconic clips; 2) they wrote unique questions about content, context, emotion, humor, usage, and world knowledge; 3) they filtered out items that models could guess using text alone or obvious on‑screen hints. Why it matters: Now we can see whether models understand the heart of what people mean—not just the words they say. 🍞 Anchor: If the screen literally shows the song title, an AI can “cheat” by reading. This benchmark hides that shortcut so the AI has to hear the tune and know its meaning.
🍞 Hook: Think of how often you use short videos and sounds to share feelings online—celebrations, fails, throwbacks, and inside jokes. 🥬 The Concept: The stakes are real. Helpful AIs should be empathetic assistants that feel in tune with people’s emotions and cultures. How it affects life: 1) Better content moderation can tell a mean meme from a playful one; 2) accessibility tools can describe not just what is heard, but how it feels; 3) global tools must respect different languages and traditions. What breaks without this: Systems remain literal, miss the vibe, and can misjudge intent. 🍞 Anchor: A creator wants AI to tag a clip’s mood as “nostalgic and funny” instead of just “man talking,” because that’s the true message viewers care about.
02Core Idea
🍞 Hook: You know how a coach tests more than sprint speed—they also check teamwork, game sense, and clutch decisions? 🥬 The Concept: The big “aha!” is a benchmark that tests whether AIs can go beyond hearing and seeing to truly understand context, culture, and usage in short, iconic audio‑visual memes. How it works: 1) Curate famous clips (speech, songs, music, sound effects) from many cultures and languages; 2) write a unique question per clip that probes seven kinds of understanding (from surface content to context, emotion, humor, usage, and world knowledge); 3) remove text and visual shortcuts so models must genuinely listen and watch; 4) evaluate many state‑of‑the‑art models and compare with humans. Why it matters: Without such a test, we might think AI “gets it,” when it only reads words and misses the human meaning. 🍞 Anchor: If an AI can tell why Rickrolls are funny and when people use “You shall not pass!”, it’s closer to understanding people, not just pixels.
🍞 Hook: Imagine three different stories that explain the same idea. 1) Sports: A player can dribble, but do they know when to pass? 2) Cooking: You can read a recipe, but can you taste when the sauce needs salt? 3) Music: You can name notes, but can you feel the groove? 🥬 The Concept: This benchmark checks that “game sense,” “taste,” and “groove”—the culture and context—beyond simple recognition. How it works: - Sports analogy: The model should “pass” when the meme is for a certain situation (usage). - Cooking analogy: It should sense mood changes (emotion) and hidden meaning (context). - Music analogy: It must interpret textless audio like music or sound effects (audio analysis). Why it matters: Mastery means using knowledge in the right moment, not just naming parts. 🍞 Anchor: Knowing a sound is a “sting” isn’t enough; the model must know it’s used to signal a dramatic twist.
🍞 Hook: Before vs. after is like black‑and‑white vs. color TV. 🥬 The Concept: Before, models were good at speech words and visible objects; after, we want models that connect clips to how people feel and use them. Changes: 1) From “what is said” to “what is meant”; 2) from “who/what/where” to “when do people use this and why did it go viral?”; 3) from per‑frame detection to time‑varying emotion and cultural memory. Why it works: Memes are a stress test—short, loud, and loaded with meaning that lives in delivery, timing, and shared history. 🍞 Anchor: “They’re taking the hobbits to Isengard” becomes a dance‑like loop; the benchmark checks if the AI sees how the mood transforms.
🍞 Hook: Think of a jigsaw puzzle: you don’t just see one piece; you fit pieces together. 🥬 The Concept: Why it works (intuition): 1) Human‑curated clips ensure true cultural anchors; 2) seven question types map stepping stones from surface to culture; 3) removing shortcuts forces real perception; 4) slicing results by sound type and language shows exactly where models struggle (e.g., textless sounds, lesser-known languages). Why it matters: This structure turns fuzzy complaints (“AI doesn’t get culture”) into measurable, fixable gaps. 🍞 Anchor: When scores drop from language tasks to world knowledge and usage, you know the model sees pieces—but can’t yet build the whole picture.
🍞 Hook: Building blocks are like layers in a cake—each adds flavor. 🥬 The Concept: The core is built from: 1) Audio‑Visual Memes; 2) Multimodal Large Language Models (MLLMs); 3) Contextual Inference; 4) Cultural Understanding; 5) Seven Question Types; 6) Cheat blockers (text‑cheat, visual‑cheat); 7) Diverse languages and sound types. How it works: Each block tests a different skill, from hearing prosody to recognizing usage patterns. Why it matters: If any layer is missing, the “cake” collapses into mere transcription. 🍞 Anchor: A model that aces Language Analysis but fails Usage likely reads words well but doesn’t know when people post the meme.
03Methodology
🍞 Hook: Imagine designing a school exam where every question checks a different skill—reading, listening, feeling the vibe, and using knowledge in real life. 🥬 The Concept: At a high level: Input (a short, iconic clip) → Step A (Human curation and annotation) → Step B (Cheat blocking and verification) → Step C (Seven‑type question assignment) → Step D (Controlled evaluation of models) → Output (Scores sliced by task, sound type, and language). Why it matters: This careful pipeline ensures we test real understanding, not accidental shortcuts. 🍞 Anchor: It’s like quizzing music students with blind listening tests instead of letting them read the song title off the screen.
Step A: Human Curation and Annotation
- What happens: 27 audio and NLP researchers from multiple regions hand‑pick 1,032 iconic internet audio‑visual memes (YouTube and Bilibili), each with a 1–30 s clip, transcript (if any), summary, year, emotion, sensitivity tags, and typical usage; they also write one unique multiple‑choice question per clip. - Why this step: Auto‑scraping could pull noisy, misleading, or unsafe content and miss cultural authenticity. Humans choose what they truly recognize and use. - Example data: “Never Gonna Give You Up” (1987): usage—bait‑and‑switch prank; emotions—surprised, nostalgic; question—“Why did this become so popular as a meme?”
🍞 Hook: You know how a magician’s trick isn’t impressive if the secret is printed on the card? 🥬 The Concept: Text‑Cheat Detection. - What it is: A filter that flags questions models can guess correctly without audio or video. - How it works: 1) Run strong LLMs (e.g., Gemini 2.5 Flash, Grok 4, GPT‑5.1) in text‑only mode; 2) if all guess the answer right, mark the item as potentially text‑cheatable; 3) revise or remove these to build a harder split called meme‑main (846 items) from meme‑full (1,032). - Why it matters: It prevents models from winning by recalling names or common hints instead of using their ears and eyes. 🍞 Anchor: If a question says “late-1970s space movie,” many will guess Star Wars without hearing the clip—that’s a text‑cheat.
🍞 Hook: Think of a closed‑book test where posters on the wall can’t give away answers. 🥬 The Concept: Visual‑Cheat Control. - What it is: A check for on‑screen text or visuals that directly reveal the answer. - How it works: 1) Human verifiers label visual hints as: no text, transcription, title/name, or visual contains solution; 2) if a clip is visual‑cheatable, models are evaluated without its visual stream to block the shortcut. - Why it matters: It makes sure the model listens and reasons, rather than just reading a title card. 🍞 Anchor: If the video flashes “Windows XP Error,” the vision system shouldn’t auto‑win.
🍞 Hook: Imagine seven mini‑quizzes, each checking a different kind of smarts. 🥬 The Concept: Seven Question Types. - What it is: A taxonomy covering: Audio Analysis (prosody/style), Language Analysis (words/grammar), Contextual Inference (intent/situation), Emotion Analysis (felt mood), Humor & Popularity (why it’s iconic), Usage & Application (how people use it), World Knowledge (external facts). - How it works: Independent human verifiers assign each question to a type after quality checks. - Why it matters: It maps the climb from surface to deep cultural understanding. 🍞 Anchor: A “You shall not pass” clip can ask: what is said (Language), what it feels like (Emotion), and when people use it online (Usage).
🍞 Hook: Like sorting songs by genre and language to see where a band plays best. 🥬 The Concept: Multi‑Axis Organization. - What it is: Every clip is labeled by sound type (speech, song, music, sound effect) and language (English, Chinese, Japanese, Korean, Persian, and more or none for textless audio). - How it works: Results are reported by question type and by these labels to reveal strengths and weaknesses. - Why it matters: It shows, for example, if models stumble on textless audio regardless of language. 🍞 Anchor: Many models scored much lower on music and sound effects than on speech.
🍞 Hook: Think of leveling the playing field so all runners start the race fairly. 🥬 The Concept: Controlled Evaluation Setup. - What it is: A uniform protocol so models get the same inputs and prompts. - How it works: 1) Audio is 16 kHz mono; video is 360p at 1 fps; 2) question with shuffled options; 3) no filenames or metadata leaks; 4) for visual‑cheat clips, hide visuals; 5) same prompt style for all. - Why it matters: Fairness—scores reflect skill, not special treatment. 🍞 Anchor: Two models hearing the same 10‑second sound must answer from the same clues.
🍞 Hook: Secret sauce time—what makes this exam clever? 🥬 The Concept: The Secret Sauce. - What it is: Human cultural curation plus layered, anti‑cheat design and fine‑grained slicing of results. - How it works: 1) Humans pick what truly matters in culture; 2) questions climb from “what” to “why and how used”; 3) cheats are blocked; 4) analyses break down by sound and language to reveal where models really fail. - Why it matters: It turns hidden weaknesses into clear, fixable targets for future models. 🍞 Anchor: After blocking shortcuts, model accuracy drops on meme‑main, proving the test now measures genuine multimodal understanding.
04Experiments & Results
🍞 Hook: Picture a science fair where every robot hears the same sounds and watches the same clips. Now we see who really gets the joke. 🥬 The Concept: The test measures question‑answer accuracy across seven skills, split by sound type and language, for many strong models—including audio‑only and audio‑visual systems—and compares them with human participants. Why it matters: Numbers become meaningful when they show where models shine (words) and where they wobble (culture, textless audio). 🍞 Anchor: Think of the scoreboard as grades per subject: A in spelling, C in music listening, B‑ in history.
- The Test: What and Why
- What was measured: Multiple‑choice Q&A accuracy per question type (content, context, emotion, humor, usage, world knowledge), by sound type (speech, song, music, sound effects), and by language. - Why: To separate “hearing words” from “getting meaning,” and to see if models handle non‑verbal audio and multicultural content. - Extra fairness: A “meme‑main” split removed text‑cheatable items; visual‑cheat clips were evaluated without visuals.
- The Competition: Who Was Compared
- Audio‑only models like GPT‑4o Audio and Music Flamingo. - Audio‑visual models like Qwen3‑Omni and Gemini 2.5/3 families. - Humans: 20 participants (10 English‑native, 10 Chinese‑native) answered clips in their own language or with no language (e.g., music). This gives a people baseline.
- The Scoreboard: Results with Context
- Overall winners: Gemini 3 Pro led with about 76.6% (audio‑only) and 80.0% (audio‑visual) on the harder meme‑main split—a solid A‑ to A performance. The best open‑source model (Qwen3‑Omni) scored around mid‑50s, more like a C+/B‑, showing a wide gap. - Audio‑visual beats audio‑only: Seeing helps, especially for identity, scenes, and interactions. - Harder split works: Scores drop by roughly 5–10 points from meme‑full to meme‑main, proving shortcut removal made the test more honest. - Content vs. culture: Nearly all models did best on Language Analysis (like reading and parsing words), often above 75–90% for the strongest models. Performance dipped for Audio Analysis (hearing style, rhythm, prosody). It dropped further for Context, Humor, Usage, and World Knowledge—especially Usage and World Knowledge, where even strong models often fell into the 20–55% range. That’s like scoring an A in spelling but a C or D in understanding the joke or where it came from. - Textless sounds are tough: On music and sound effects, many models hovered around mid‑30s to mid‑40s, much lower than speech and songs (often 60%+). This shows models rely heavily on words. - Language gaps: English and Chinese did best; Japanese, Korean, Persian, and “no language” (music/SFX) lagged. Even top models dipped into the 35–55% range for these, underscoring global coverage challenges. - Humans vs. models: Humans did much better on clips they knew. On unfamiliar clips, strong open‑source models could compete with a single human, but humans still outperformed many models overall—showing people’s cultural grounding is a big advantage.
- Surprising Findings
- More “thinking” doesn’t always help: Longer reasoning boosted basic recognition (Audio and Language) and certain World Knowledge (recall) for Gemini 3 Pro, but barely helped Context and Humor, and sometimes hurt Usage and Emotion. Like over‑thinking a joke until it stops being funny. - Tiny hints can inflate scores: Saying “This is a meme” added a bit; revealing the actual meme name boosted scores by roughly 10 points—proof that recall beats real listening if allowed. - On‑screen text is a huge shortcut: If visuals show titles or solutions, accuracy jumps a lot. Once blocked, scores drop—confirming the need for cheat control.
- Takeaway Patterns
- Strengths: Surface language tasks; recognition aided by visuals; known languages like English and Chinese. - Weaknesses: Culture and pragmatics (Context, Humor, Usage), textless audio (music, SFX), and lesser‑resourced languages. - Big picture: Models can “hear words,” but still struggle to “hear meaning.” 🍞 Anchor: It’s like a student who reads every line out loud perfectly but misses the punchline of the comic strip.
05Discussion & Limitations
🍞 Hook: Imagine grading a bright student who can read fast but misses the story’s heart. Where do they still stumble, and how can we help? 🥬 The Concept: An honest assessment looks at limits, resources, where not to use this, and what’s still unknown.
- Limitations (what this can’t do yet): 1) Cultural coverage reflects the curators (mostly highly educated adults aged 22–35), not every community worldwide; 2) meme meanings shift over time, so labels reflect “now,” not forever; 3) 30‑second clip limits might cut off context; 4) multiple‑choice on single clips isn’t the same as real conversations and personalized experiences; 5) meme interpretations can be subjective—there isn’t always one true answer. - Required resources (to use the benchmark well): 1) Models that accept audio and video; 2) preprocessing to standard formats; 3) evaluation code that shuffles answers and blocks metadata leaks; 4) safe‑use filters for sensitive content. - When NOT to use: 1) High‑stakes safety decisions (e.g., legal judgments) since cultural readings can be subjective; 2) training or fine‑tuning directly on this test (to avoid contamination) instead of just evaluating; 3) tasks requiring long video context beyond 30 seconds or multi‑turn dialogue. - Open questions: 1) How to teach models the “usage” of memes—like social rules, not just facts? 2) How to learn from textless audio (music, SFX) so meaning is felt, not just labeled? 3) How to generalize to minoritized languages and cultures fairly? 4) Can longer thinking be guided to help culture tasks instead of over‑explaining? 5) What training signals best align models to human interpretations of humor and emotion without overfitting to trends? Why it matters: Knowing the edges of today’s abilities helps researchers build tomorrow’s more human‑savvy models. 🍞 Anchor: It’s like realizing the student needs practice with tone and subtext—so next time they’ll laugh at the right moments.
06Conclusion & Future Work
🍞 Hook: Picture an exam that doesn’t just ask what a clip says but whether you truly get why people share it. 🥬 The Concept: In three sentences: AVMeme Exam is a human‑curated test of 1,032 audio‑visual memes that measures whether AI understands content, context, emotion, humor, usage, and world knowledge—without shortcut clues. Across many top models, performance is strong on surface language but significantly weaker on culture and textless audio, especially in lesser‑resourced languages. This reveals a real gap between current multimodal AI and human‑aligned cultural understanding. Main achievement: A rigorous, multicultural, multimodal benchmark that makes hidden weaknesses visible and measurable—turning “AI doesn’t get culture” into concrete numbers and tasks. Future directions: Expand cultural and language coverage, lengthen context windows, move beyond multiple choice to dialogue and personalization, and invent training objectives that teach models to feel timing, usage, and shared meaning—especially for music and sound effects. Why remember this: Because online life speaks through tiny sounds and short clips; an AI that truly helps people must understand not only the words, but the wink, the rhythm, and the reason we share.
Practical Applications
- •Improve recommendation systems to detect a clip’s vibe (e.g., nostalgic, ironic) and not just its words.
- •Build better accessibility tools that describe not only what’s happening but how it feels (e.g., suspenseful music).
- •Enhance content moderation by distinguishing playful teasing from harmful targeting in meme contexts.
- •Assist creators with automatic meme usage suggestions (e.g., when a sound effect fits a situation).
- •Support education platforms by explaining why a clip is culturally significant, not just when it appeared.
- •Boost customer support bots that must understand reaction clips users attach to messages.
- •Guide multilingual product design by revealing gaps in lesser-resourced languages and nonverbal audio.
- •Help music and audio analysis apps infer emotional intent and common cultural uses of sound cues.
- •Inform AI training strategies that teach models to reason over textless audio like music and SFX.
- •Standardize evaluation for multimodal products to prevent inflated scores from visual or text hints.