What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models
Key Summary
- •Real people often ask vague questions with pictures, and today’s vision-language models (VLMs) struggle with them.
- •The paper builds HAERAE-Vision: 653 tough, real Korean image+question pairs, each also rewritten into a clear, explicit version.
- •Even top models scored under 50% on the original vague questions, showing a big reality gap in current evaluations.
- •Simply rewriting the same question to be explicit boosts scores by 8–22 points, helping smaller models the most.
- •Web search helps a bit, but it cannot fix missing details in the user’s question; explicit questions without search still beat vague ones with search.
- •About a quarter of the tasks need Korean cultural knowledge, which many global models lack.
- •A careful checklist and an LLM-as-judge provide fair, partial-credit scoring that matches human judgments well.
- •The main message: model “failure” is often about unclear user input, not only about weak model ability.
- •Better benchmarks and tools should teach models (and UIs) to surface what users leave unsaid.
- •This matters for everyday help—like fixing a home project or identifying a gadget in your car—where people naturally ask short, messy questions.
Why This Research Matters
This work shows why AI can seem smart in labs but stumble in real life: people naturally leave things unsaid. By proving that clarity alone boosts scores significantly, it points product teams to add “make-it-clear” steps (rewrites or follow-up questions) before answering. It also highlights a cultural knowledge gap, reminding builders that global users need locally grounded models. For everyday users, this means fewer wrong purchases, safer guidance on tools or car features, and better step-by-step help for math, coding, and home fixes. For educators and companies, it suggests training and UI designs that guide users to add missing details. Overall, it brings AI evaluation closer to how we actually ask for help—short, messy, and image-based.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook) You know how when you text a friend, you sometimes write “How do I do this?” and send a quick photo, expecting they’ll just “get it”? That works with friends—because they know you and can see the picture.
🥬 Filling (The Actual Concept)
- What it is: Vision-language models (VLMs) are AIs that read pictures and words together to answer questions or follow instructions.
- How it works: They look at the image, read the text, connect the two, and then produce an answer.
- Why it matters: If the question is vague—like “this” or “that”—the model has to guess your intent. Without enough clues, it guesses wrong.
🍞 Bottom Bread (Anchor) Imagine sending a blurry photo of a car gadget with the message, “What’s this for?” If you don’t say it’s above the steering wheel and meant to watch your face, the AI might never realize it’s a driver monitoring camera.
🍞 Top Bread (Hook) Think of a scavenger hunt clue that says, “It’s under that thing.” You’d be stuck unless someone tells you which thing.
🥬 Filling (The Problem Before This Paper)
- What it is: Most tests for VLMs use super-clear, well-structured questions that tell the AI exactly what to do.
- How it works: Benchmarks usually avoid slang, avoid vagueness, and spell out every needed detail.
- Why it matters: In real life, people don’t write like that—they skip details and rely on the picture. So test scores looked better than what happens in the real world.
🍞 Bottom Bread (Anchor) On forums, people ask, “How do I remove this?” with a ceiling photo. A clean benchmark would say, “How do I remove the white ceiling ring-shaped hook, including the inner metal fitting?” Those are very different challenges.
🍞 Top Bread (Hook) Imagine a teacher who only gives practice problems where every step is already labeled. You’ll ace the test—but freeze when the real-life puzzle leaves steps unlabeled.
🥬 Filling (Failed Attempts)
- What people tried: Make models bigger; add web search; add reasoning tricks.
- Why it didn’t work: If the model doesn’t know what you actually want (your intent), extra size or search still won’t find the right facts. Search needs the right keywords first.
- What breaks without it: The model chases wrong leads, like Googling “worm sea” when the picture is actually a snail species.
🍞 Bottom Bread (Anchor) Even with search on, models did worse on vague questions than on clear ones without search. Like trying to find a book in a library when you don’t know the title, author, or topic.
🍞 Top Bread (Hook) Picture a translator who’s great at both languages but doesn’t know your inside jokes or local slang. They’ll miss the meaning even if the words look right.
🥬 Filling (The Gap)
- What was missing: A benchmark built from authentic, messy, under-specified questions that people actually ask, plus a clear rewrite of those same questions to compare.
- How it works: The paper builds HAERAE-Vision from 86,052 real posts, filters sharply to 653 high-quality items (0.76%), and pairs each with an explicit rewrite.
- Why it matters: Now we can measure how much of the difficulty comes from the question itself (lack of clarity) versus from the model’s ability.
🍞 Bottom Bread (Anchor) With both versions—vague vs explicit—we can finally see: the same picture + clearer words = big score jumps.
🍞 Top Bread (Hook) Why should anyone care? Because you, your parents, and your teachers all ask for help with photos—from fixing a baseboard to identifying a marine creature.
🥬 Filling (Real Stakes)
- What it is: Everyday help needs models that understand what you mean, even when you don’t say everything.
- How it works: If tools can surface missing details (“Do you mean the metal fitting under the white hook?”), they’ll help faster and make fewer mistakes.
- What breaks without it: Wrong advice (buying the wrong product), missed safety info (car systems), or confusion (math steps that were never stated).
🍞 Bottom Bread (Anchor) If a model knows to ask, “Is this the Driver Monitoring System camera above your steering wheel that checks for drowsiness?”, you’ll get the right answer the first time.
02Core Idea
🍞 Top Bread (Hook) Imagine playing charades: your teammate says “Do that thing!” while pointing. If they don’t say what “that thing” is, even a genius can lose.
🥬 Filling (The Aha! Moment)
- One sentence: A big chunk of VLM “difficulty” isn’t about weak models—it’s about users leaving key details unsaid; making the same question explicit unlocks big gains.
- How it works: Build a benchmark from real, under-specified questions; rewrite each one to be clear; test many models on both; compare.
- Why it matters: We finally separate “model can’t” from “question wasn’t clear,” so we can fix the right problem.
🍞 Bottom Bread (Anchor) Top models scored under 50% on the original questions but jumped 8–22 points after rewriting the same questions clearly.
🍞 Top Bread (Hook) You know how a librarian helps by asking, “Which author? Which year?” to find your book? That’s turning a vague request into a precise one.
🥬 Filling (Concept 1: HAERAE-Vision)
- What it is: A benchmark of 653 real Korean image+question pairs that are naturally messy and under-specified, each paired with a clear rewrite.
- How it works: Start with 86k real posts; filter for safety, difficulty, visual need, and quality; keep 653 (0.76%); add checklists for fair scoring.
- Why it matters: It mirrors real life, not just classroom-perfect questions, showing what models truly face.
🍞 Bottom Bread (Anchor) Questions like “What is this?” with a photo of a car interior are common; now they’re tested head-to-head against the explicit rewrite (“What does the camera above the steering wheel do?”).
🍞 Top Bread (Hook) Think of adding labels to a confusing map: suddenly, routes make sense.
🥬 Filling (Concept 2: Query Explicitation)
- What it is: Rewriting a vague question so it includes the key who/what/where details needed to answer.
- How it works: Replace “this/that” with the named object, mention location or domain, and keep the original intent and scope.
- Why it matters: Without it, models search or reason in the wrong direction; with it, they zoom in on the right target immediately.
🍞 Bottom Bread (Anchor) “Where can I buy this?” becomes “Where can I buy a wooden skirting board (목재 걸레받이) like the one in the photo?”
🍞 Top Bread (Hook) Imagine visiting another country where road tools look different—your brain might mislabel things.
🥬 Filling (Concept 3: Cultural Grounding)
- What it is: Some questions require Korean-specific knowledge (policies, brands, signs, UI layouts, everyday items).
- How it works: About 23.7% of items need this knowledge; models trained mostly on English content often miss it.
- Why it matters: Even clear questions can fail if the model lacks local knowledge.
🍞 Bottom Bread (Anchor) Orange roadside bags in Korea are winter sandbags—models without this context guessed “safety markers” or “wasp traps.”
🍞 Top Bread (Hook) If a referee uses a checklist to judge a performance, you get partial credit for what you did right.
🥬 Filling (Concept 4: Checklist-Based Scoring)
- What it is: Each question has 1–5 concrete checkpoints (criteria) the answer must cover.
- How it works: The model’s response gets points for each criterion met or partially met, judged by an LLM with strict rules.
- Why it matters: It’s fair, allows partial credit, and reveals exactly what was missing (like naming the exact species or giving steps).
🍞 Bottom Bread (Anchor) For the Jeju sea creature: criteria might include (1) say it’s Dendropoma maxima, (2) clarify it’s a snail, not a worm.
🍞 Top Bread (Hook) Even the best flashlight can’t help if you don’t know what you’re looking for.
🥬 Filling (Before vs After)
- Before: We thought low scores meant weak models.
- After: We learned that unclear questions hide a lot of the difficulty; making them explicit lifts scores immediately.
- Why it works: Clarity supplies the missing “keywords” for both reasoning and search.
🍞 Bottom Bread (Anchor) Original+search still scored lower than explicit-without-search. Once the question is clear, search adds only a small extra boost.
🍞 Top Bread (Hook) A recipe card works because it says what, where, and how.
🥬 Filling (Why It Works: Intuition)
- What it is: Explicitation lowers ambiguity and raises intent clarity.
- How it works: It narrows the solution space, aligns the image region to inspect, and provides the right labels for search.
- Why it matters: Less guessing, more knowing.
🍞 Bottom Bread (Anchor) “Is this asbestos?” vs “Is this gypsum board (석고텍스) in the photo asbestos?” The second guides the model to the exact safety fact.
03Methodology
🍞 Top Bread (Hook) Think of this like turning a messy backpack into neat folders so you can find any paper fast.
🥬 Filling (High-Level Recipe)
- What it is: A pipeline that builds the dataset and a scoring system that measures answers fairly.
- How it works (Input → Steps → Output):
- Input: 86,052 real Korean (question, image, answer) posts.
- Steps: Safety/objectivity filtering → Difficulty calibration → Image-dependency check → Checklist creation → Human validation → Pair with explicit rewrites.
- Output: 653 under-specified items + 653 explicit rewrites, each with a checklist and a trustworthy scoring process.
- Why it matters: It keeps real-life messiness but ensures every item is answerable, visual, and fairly graded.
🍞 Bottom Bread (Anchor) Like cleaning a big pile of mixed homework into a final binder of the most useful, graded problems.
🍞 Top Bread (Hook) Imagine cleaning the pool before you swim.
🥬 Filling (Step 1: Appropriateness Filtering)
- What it is: Remove unsafe, too subjective, or time-sensitive items.
- How it works: An LLM flags risky content (e.g., adult, hate), unverifiable opinions, or “now/today” questions; humans spot-check later.
- Why it matters: Keeps the benchmark safe, fair, and timeless.
🍞 Bottom Bread (Anchor) “Is this store open now?” gets filtered out because “now” changes.
🍞 Top Bread (Hook) If a puzzle is too easy, it won’t teach you much.
🥬 Filling (Step 2: Difficulty Calibration)
- What it is: Remove items that strong models already solve trivially.
- How it works: Test three strong models and drop items where they match human answers too closely (high overlap).
- Why it matters: Ensures remaining items are challenging and informative.
🍞 Bottom Bread (Anchor) If GPT-4o, Gemini 1.5 Flash, and Claude 3.5 all ace a question, it’s not useful for this benchmark.
🍞 Top Bread (Hook) A riddle about a picture should actually need the picture.
🥬 Filling (Step 3: Image Dependency Verification)
- What it is: Check that the image is truly necessary to answer.
- How it works: Generate answers with and without the image; if quality is similar, discard the item.
- Why it matters: Keeps only genuinely visual questions.
🍞 Bottom Bread (Anchor) If you can answer just from the text alone, it’s not the right kind of visual question here.
🍞 Top Bread (Hook) Judges in a contest use a clear rubric so everyone knows what counts.
🥬 Filling (Step 4: Checklist Generation)
- What it is: Turn the accepted human answer into 1–5 precise criteria.
- How it works: An LLM drafts checklists; humans refine; criteria focus on correctness and reasoning steps.
- Why it matters: Allows partial credit and consistent, reproducible scoring.
🍞 Bottom Bread (Anchor) For a coding error: (1) explain red underline cause (non-ASCII path), (2) suggest renaming path in English or resetting IDE cache.
🍞 Top Bread (Hook) Quality control is better with more than one pair of eyes.
🥬 Filling (Step 5: Human Validation)
- What it is: Native Korean annotators clean and confirm everything.
- How it works: Three phases—conservative filtering, refinement (rewriting unclear checklists/questions), and final audit.
- Why it matters: Removes bad items and ensures clarity and cultural accuracy.
🍞 Bottom Bread (Anchor) If a checklist says “mention ‘달팽이 (snail)’ not ‘지렁이 (worm)’,” humans verify that’s truly what the image shows.
🍞 Top Bread (Hook) Rewrite a fuzzy map label into a sharp one.
🥬 Filling (Step 6: Query Explicitation)
- What it is: Create a clear version of each original question without changing its intent.
- How it works: Replace “this/that/here” with object names, include domain/context (e.g., game title), and verify proper nouns with web search.
- Why it matters: Lets us isolate the effect of clarity itself.
🍞 Bottom Bread (Anchor) “Are there more of these dragons?” becomes “In Genshin Impact, beyond the three baby dragons in Parkatin’s quest, are there additional ones to find?”
🍞 Top Bread (Hook) A fair referee follows the same rules every time.
🥬 Filling (Concept: LLM-as-Judge)
- What it is: A small model (GPT-5-Mini) scores each checklist item as met/partly/not met based only on the model’s answer.
- How it works: It must quote evidence from the answer and justify each score; final score averages all checklist items.
- Why it matters: Enforces explicitness and reduces bias; cross-checked with other LLM judges and humans for reliability.
🍞 Bottom Bread (Anchor) Judge reports might say “met 3.5/5,” including a short quote from the answer that proves a criterion was satisfied.
🍞 Top Bread (Hook) Two similar roads, two different destinations—measure both.
🥬 Filling (Evaluation Setup)
- What it is: Test 45 VLMs on both original and explicit questions; sometimes with web search.
- How it works: Unified settings (temperature, top_p, max tokens), three runs per item; compare scores.
- Why it matters: Apples-to-apples comparisons across sizes, families, and conditions.
🍞 Bottom Bread (Anchor) We can say things like: “Explicit questions improved GPT-5-Nano by +21.7 points.” That’s a concrete, fair comparison.
04Experiments & Results
🍞 Top Bread (Hook) Think of a league table where teams play the same game by the same rules—now you can really compare them.
🥬 Filling (The Test)
- What it is: Measure how well 45 VLMs answer real, under-specified questions vs the explicit rewrites, using checklists.
- How it works: Score per checklist item (1/0.5/0), average across items and questions; run three times for stability.
- Why it matters: Shows exactly how much clarity alone helps, separate from raw model power.
🍞 Bottom Bread (Anchor) Top models (Gemini 2.5 Pro, GPT-5) scored under 50% on original questions. With explicit rewrites, they crossed ~55%.
🍞 Top Bread (Hook) If the best students only get half the riddles right, the riddles are tough—or the instructions are unclear.
🥬 Filling (The Competition)
- What it is: Proprietary models (GPT-5 family, Gemini 2.5 family, Sonar-Pro, Grok-4, etc.) vs open-source models (Qwen, InternVL, Gemma, Mistral/Pixtral, Ovis2, etc.) and Korean-specialized models.
- How it works: Same dataset, same scoring. Some runs enable web search.
- Why it matters: Reveals gaps across families, scales, and features like search.
🍞 Bottom Bread (Anchor) Best proprietary models neared 48–49% on originals; strongest open models hovered around half that. Korean-specialized models lagged more on this benchmark.
🍞 Top Bread (Hook) Sharpening the question is like turning on a light switch—suddenly the room looks different.
🥬 Filling (The Scoreboard)
- Headline: Explicit rewrites boost scores by about 8–22 points. Smaller models benefit the most.
- Examples: • GPT-5: 48.0% → 57.6% (+9.6) • Gemini 2.5 Pro: 48.5% → 56.7% (+8.1) • GPT-5-Nano: 21.2% → 43.0% (+21.7)
- Why it matters: Under-specification is a huge hidden tax on performance, especially for small models.
🍞 Bottom Bread (Anchor) It’s like giving everyone the same math problem but rewriting the directions more clearly—the biggest improvements show up in the students who struggled most.
🍞 Top Bread (Hook) Searching the web won’t help if you don’t know what to search for.
🥬 Filling (Surprising Finding: Search Can’t Replace Clarity)
- What it is: Original+search underperformed explicit-without-search.
- Numbers (example): For GPT-5, Original+Search ≈ 55.6% vs Explicit ≈ 57.6%.
- Why it matters: Retrieval needs precise keywords; vague questions don’t provide them. Best scores happen when you combine both clarity and search (≈ 59.7%), but clarity does the heavy lifting.
🍞 Bottom Bread (Anchor) “Where can I buy this?” with search still trails “Where can I buy a wooden skirting board like this?” without search.
🍞 Top Bread (Hook) Different subjects react differently when you add clarity—like tutoring that helps some classes a lot and others a little.
🥬 Filling (By Category)
- Biggest gains: Math, Science, Coding, Shopping—where missing steps or names caused confusion.
- Still hard: Natural Objects and Entertainment/Arts—errors shifted to visual grounding and cultural knowledge.
- Cultural share: 23.7% of items needed Korean-specific knowledge.
🍞 Bottom Bread (Anchor) Models misidentified a Korean flip phone model and confused Korean winter sandbags with unrelated items—clear signs of cultural gaps.
🍞 Top Bread (Hook) Are the judges fair? Let’s check if they agree.
🥬 Filling (Judge Reliability)
- What it is: Multiple LLM judges (GPT-5, GPT-5-Mini, Gemini 2.5 Pro/Flash) and humans cross-checked scores.
- How it works: 250-sample re-evaluation; high correlations (Pearson ~0.86–0.90), Krippendorff’s α ≈ 0.867; human alignment also strong.
- Why it matters: The checklist+LLM-judge method gives a stable, human-aligned signal.
🍞 Bottom Bread (Anchor) When different referees whistle the same fouls, players trust the game.
05Discussion & Limitations
🍞 Top Bread (Hook) Even a great map has spots marked “Here be dragons.”
🥬 Filling (Honest Assessment)
- Limitations: • Language scope: Built for Korean; other languages need their own culturally grounded sets. • Aggressive filtering: Only 0.76% survived—great quality, but some edge cases might be lost. • Search scope: Only tested OpenAI’s web search; other retrieval systems might shift details but won’t fix missing intent. • LLM judge bias: Strong agreement, yet still not perfect—some keyword-matching or leniency remains.
- Required resources: • Access to images and Korean text, LLMs for rewriting and judging, human annotators for validation. • Compute to run many models across 1,306 query variants.
- When not to use: • If your task is purely text-only, time-sensitive (e.g., “open now?”), or requires personal advice without objective checklists. • If you need medical/PII-heavy content beyond what the dev subset safely allows.
- Open questions: • How to teach models to ask the right clarifying question before answering? • How to build richer cultural grounding across regions and languages? • Can multimodal search (using image cues, not just text) close the gap further? • What training recipes make models robust to vague inputs without hallucinating?
🍞 Bottom Bread (Anchor) Future systems might first ask, “Do you mean the metal fitting behind the ceiling hook?” before they answer or search, leading to safer and better help.
06Conclusion & Future Work
🍞 Top Bread (Hook) It’s easier to hit the target when someone turns on the lights and tells you which target to aim at.
🥬 Filling (Takeaway)
- 3-sentence summary: The paper builds HAERAE-Vision, a benchmark of 653 real, under-specified Korean visual questions, each paired with a clear rewrite. Testing 45 VLMs shows that unclear questions drag scores way down—even for top models—while explicit rewrites raise scores by 8–22 points. Web search helps a bit, but it cannot replace missing intent; cultural knowledge remains a key obstacle.
- Main achievement: Proving that a large share of VLM failure comes from what users leave unsaid, and giving a controlled, paired benchmark to measure it.
- Future directions: Teach models to ask clarifying questions; expand to other languages and cultures; deepen multimodal search; train with explicit/implicit pairs.
- Why remember this: When AI seems weak, sometimes the question is the problem—clarity and culture matter as much as model size.
🍞 Bottom Bread (Anchor) Next time you ask an AI “How do I remove this?”, adding a few missing words might be the difference between a wrong guess and exactly the help you needed.
Practical Applications
- •Add a clarifying-question step in chat UIs when a user writes “this/that/here” with a photo.
- •Offer one-click “Make my question clearer” buttons that auto-rewrite user prompts into explicit versions before answering.
- •Train models on paired vague–explicit question sets to learn how to infer or request missing details.
- •Build culturally grounded modules (Korea, Japan, India, etc.) to improve local object recognition and UI interpretation.
- •Use checklist-style evaluation in production QA to catch missing steps (e.g., naming the exact part and providing the fix).
- •Enhance retrieval to accept visual cues (multimodal search) so it doesn’t rely only on text keywords.
- •Detect under-specified prompts automatically and warn: “I need more details about the object or location.”
- •Create domain-specific templates (car interiors, home hardware, gaming UIs) that map common vague phrases to precise terms.
- •Prioritize dataset curation that includes messy, forum-style questions rather than only clean, textbook prompts.
- •Provide user education tips like, “Name the item and where it appears in the photo for faster, better answers.”