🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing | How I Study AI

How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

Intermediate
Huanyu Zhang, Xuehai Bai, Chengzu Li et al.2/2/2026
arXivPDF

Key Summary

  • •VIBE is a new test that checks how well image-editing AI models follow visual instructions like arrows, boxes, and sketches—not just text.
  • •It organizes tasks into three levels: pointing to places (Deictic), reshaping things (Morphological), and causing changes like light or wind (Causal).
  • •The benchmark has 1,034 carefully checked examples across 10 tasks, from adding/removing objects to predicting billiard ball paths.
  • •A large multimodal model (LMM) acts like a fair judge, using task-specific checklists to score whether edits followed the instructions, kept the background safe, and looked clean.
  • •Proprietary models (like Nano Banana Pro) generally beat open-source ones, especially on simple pointing tasks, but all models struggle with causal reasoning.
  • •Performance drops as tasks get harder: good at Deictic, okay at Morphological, and weak at Causal (often under 50%).
  • •Combining multiple visual instructions in one request makes models stumble, showing trouble with compositional understanding.
  • •Visual and text instructions work best together: visuals ground where to edit, while text clarifies what to change.
  • •The benchmark’s scores from the LMM judge strongly agree with human expert ratings, making the evaluation reliable.
  • •VIBE highlights clear research gaps and gives a roadmap for building better visual instruction-following editors.

Why This Research Matters

VIBE makes editing AIs easier and safer to use by letting people show exactly what they want with simple marks, just like we do in real life. It reduces misunderstandings, so you get the right change in the right place without damaging the rest of the image. Designers, photographers, and everyday users can work faster because they can point and sketch instead of writing long, tricky prompts. Companies can compare models fairly and pick the one that best follows human-style instructions. Researchers get clear clues about which skills to improve—like causal reasoning or style-preserving reshaping—so progress speeds up. As models learn to reliably follow arrows and sketches, creative tools become more intuitive and accessible to everyone.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine trying to tell a friend where to stick a sticker on a busy poster using only words. “A little left… no, above the cat… not that cat, the sleeping one!” It gets frustrating fast. Now imagine you can just draw a red box around the spot. Instantly easier, right?

🥬 The World Before: For years, most image-editing AIs learned to edit pictures by reading text-only prompts. People wrote long, precise sentences to say “add a flower here” or “make the light come from the left.” This worked okay for simple changes but felt unnatural, because humans don’t only talk—we also point, sketch, and draw arrows when we want to be super clear.

🥬 The Problem: Text is clumsy for spatial intent. Saying “the second window from the right, top row” is vague and mentally tiring. Models also have to translate those words into exact pixel locations and shapes, which is hard and error-prone. That double-sided cognitive load (hard for users, hard for models) causes mistakes: edits happen in the wrong place, the wrong thing gets changed, or the background gets damaged.

🥬 Failed Attempts: Earlier benchmarks mostly checked text-guided editing. They measured if the AI could add/remove/replace objects, but rarely tested if models could follow visual cues like boxes, arrows, or rough sketches. Some newer tests tried to score results with automated judges, but they still focused on text-only instructions and didn’t separate skills like “did you edit the right spot?” from “does it look seamless?”

🥬 The Gap: What was missing was a systematic way to evaluate visual instruction-following—the natural, human way of guiding edits with marks on the image. We needed a ladder of tasks that starts with pointing to places, moves to reshaping bodies and objects, and then climbs to cause-and-effect changes like lighting and wind. And we needed a reliable judge to score these without a single perfect ground truth image.

🍞 Anchor: Think of a kid’s art class. It’s way easier to say “color inside this red box” than to describe the box with just words. VIBE turns those red boxes, arrows, and sketches into a fair test for editing AIs, checking how well they listen when we show instead of only tell.

— New Concept — 🍞 Hook: You know how in a treasure map, X marks the spot so you don’t guess?

🥬 Visual Instruction: A visual instruction is a drawing on or next to the image—like a box, arrow, or sketch—that tells the model exactly where and how to edit.

  • How it works: (1) You supply the image, a short text, and a marked-up instruction image; (2) the model reads the marks to find the target spot or shape; (3) it applies the edit; (4) it tries to keep the rest unchanged.
  • Why it matters: Without visual instructions, models must guess positions and shapes from words alone, leading to wrong locations or messy edits.

🍞 Anchor: Drawing a red box over a pillow and saying “replace with blue blanket” is clearer and faster than typing a paragraph about which pillow you mean.

02Core Idea

🍞 Hook: Picture a three-step video game: Level 1 is pointing, Level 2 is reshaping, and Level 3 is predicting what happens next. Each level makes your brain work a bit harder.

🥬 The Aha: VIBE’s key insight is to test visual instruction-following along a simple but powerful ladder—pointing (Deictic), reshaping (Morphological), and cause-and-effect (Causal)—so we can see exactly where models succeed and where they fall apart.

  • Multiple Analogies:

    1. Navigation: Deictic = “Go to this exact address.” Morphological = “Rebuild the house to this blueprint.” Causal = “If a storm hits from the east, where will the roof tiles fly?”
    2. Art class: Deictic = “Color inside this box.” Morphological = “Redraw this character in this pose.” Causal = “Shine a lamp from here; where do shadows fall?”
    3. Sports: Deictic = “Stand here.” Morphological = “Change your stance to match this diagram.” Causal = “Hit the ball this way—predict its path and rebounds.”
  • Before vs After: Before VIBE, we judged mostly text-following and simple edits. After VIBE, we can tell whether a model can: (1) correctly find the right spot; (2) reshape without breaking identity or style; and (3) reason about physics-like changes from arrows and vectors.

  • Why It Works (intuition): Visual marks remove guesswork about “where,” sketches guide “what shape,” and arrows encode “what will happen.” Separating these lets us diagnose weaknesses precisely: a model might nail pointing but fail at reshaping joints, or cope with pose changes but break when asked to relight a scene.

  • Building Blocks: VIBE formalizes inputs as (Image, Text, Visual Instruction) → Edited Image and scores results with clear, binary-friendly checks tied to each level.

— New Concept — 🍞 Hook: You know when you point your finger and say, “Put it HERE”?

🥬 Deictic Grounding: Deictic grounding means using marks like boxes or arrows to select the exact place or object to edit.

  • How it works: (1) Draw a box/arrow; (2) the model locks onto that location; (3) it adds, removes, replaces, or moves content only there; (4) it keeps everything else untouched.
  • Why it matters: Without it, models change the wrong thing or mess up the background.

🍞 Anchor: A red box around a chair plus “remove” should erase the chair and leave the room intact.

— New Concept — 🍞 Hook: Imagine reshaping clay to match a wireframe model.

🥬 Morphological Manipulation: This is reshaping or re-posing objects to match a structural sketch or reference, while keeping identity and style.

  • How it works: (1) Provide a pose skeleton or sketch; (2) the model maps lines to real shapes; (3) it adjusts limbs/orientation; (4) it preserves who/what it is and how the scene looks.
  • Why it matters: Without it, characters lose identity or look glued-on.

🍞 Anchor: “Make this dancer match the pose in the stick-figure image,” but still look like the same dancer.

— New Concept — 🍞 Hook: Blow on a dandelion and watch the seeds fly—that’s cause and effect.

🥬 Causal Reasoning: Causal reasoning means predicting visual consequences from arrows or force cues (like lighting direction or wind).

  • How it works: (1) Read an arrow; (2) infer the physical effect (shadows shift, hair blows); (3) apply globally consistent changes; (4) keep identities and layout.
  • Why it matters: Without it, models make random changes that ignore physics.

🍞 Anchor: An arrow from the left should produce left-lit faces and right-cast shadows; a billiards arrow should yield the correct bounce path.

03Methodology

🍞 Hook: Think of a recipe: you have ingredients (image + text + marks), you follow steps (find target, apply edit, check quality), and you taste-test the result.

🥬 Overview: At a high level: Input (image + text + visual instruction) → Determine target and intent → Apply edit per task type → Evaluate with an LMM judge using task-specific metrics → Final score.

Steps in Detail:

  1. Gather Inputs
  • What happens: The system takes (I_in, T, V): the original image, a short text instruction, and the visual instruction (overlaid marks or a separate guide image).
  • Why it exists: Text says “what,” visuals say “where/how.” Without both, ambiguity explodes.
  • Example: “Replace the boxed area with a clock.” The box pinpoints the spot; the text names the object.
  1. Choose the Task Type (by benchmark split)
  • What happens: VIBE organizes 10 tasks into 3 levels: Deictic (Addition, Removal, Replacement, Translation), Morphological (Pose Control, Reorientation, Draft Instantiation), and Causal (Light Control, Flow Simulation, Billiards).
  • Why it exists: Different skills need different tests. Without separation, we can’t tell what skill failed.
  • Example: An arrow plus “move this bear here” goes to Translation; a stick-figure pose goes to Pose Control.
  1. Generate Edited Image (by the model under test)
  • What happens: The model transforms pixels according to V and T: it inserts, deletes, reshapes, relights, or simulates motion.
  • Why it exists: This is the heart—turn instructions into pixels. Without it, no edit occurs.
  • Example: For Removal, the model inpaints the background where the box was.
  1. Evaluate with LMM-as-a-Judge
  • What happens: A strong multimodal model (GPT-5.1) inspects the input, the marked-up instruction, and the output, then scores task-specific criteria.
  • Why it exists: There’s no single “correct” edited image. The judge standardizes scoring at scale.
  • Example: It checks if the edit stayed inside the box and if the style is seamless.

— Secret Sauce —

  • Task-specific, mostly binary checks reduce ambiguity; geometric means reward edits that are simultaneously correct, contained, and clean. Repeating evaluation three times and averaging reduces randomness.

— New Concept — 🍞 Hook: You know how a fair referee watches the game and marks simple yes/no fouls?

🥬 LMM-as-a-Judge: An LMM-as-a-Judge is a big vision-language model that grades edits using clear checklists.

  • How it works: (1) See input + marks + output; (2) answer binary sub-questions (e.g., edited the right spot?); (3) combine sub-scores into a final score; (4) repeat to stabilize.
  • Why it matters: Without a fair, scalable judge, comparing models is slow and inconsistent.

🍞 Anchor: The judge says “yes” to editing the boxed chair only and “no” if a nearby table also got warped.

— Deictic Level Metrics —

— New Concept — 🍞 Hook: Like checking if a Lego build matches the plan, in the right place, doing the right action.

🥬 Instruction Adherence (IA): IA measures if the model followed the visual and text instructions: correct location, correct type of edit, and correct action.

  • How it works: (1) Did it edit the marked target? (2) Was the operation type right (add/remove/replace/move)? (3) Did it match the text intent? Average these.
  • Why it matters: If IA fails, the main instruction wasn’t followed.

🍞 Anchor: Box says “add a red pillow.” IA checks that a red pillow appears inside the box—not elsewhere, not blue, not a delete.

— New Concept — 🍞 Hook: Imagine painting a tiny spot without spilling paint anywhere else.

🥬 Contextual Preservation (CP): CP checks that non-target regions stayed untouched.

  • How it works: Compare input vs output; score 1 if nothing outside the target changed in a meaningful way.
  • Why it matters: Without CP, background damage ruins trust.

🍞 Anchor: Replacing a poster shouldn’t move the sofa or bend the wall.

— New Concept — 🍞 Hook: Like patching a hole so smoothly no one can see the seam.

🥬 Visual Coherence (VC): VC measures if the edited part fits the image’s style, blends smoothly, and avoids artifacts.

  • How it works: (1) Style matches domain; (2) no hard seams; (3) no glitches. Average these.
  • Why it matters: A correct edit that looks pasted-on still feels wrong.

🍞 Anchor: A watercolor scene edited with matching watercolor strokes, not photoreal parts.

Scoring Formula (Deictic and Draft Instantiation): Final Score = geometric mean of IA, CP, VC. If IA = 0, then VC = 0 (because a wrong instruction shouldn’t be rewarded for looking pretty).

— Morphological Level Tasks —

— New Concept — 🍞 Hook: Copying a dancer’s pose from a stick figure.

🥬 Pose Control: Change the character’s pose to exactly match a reference while keeping identity, body wholeness, and the scene.

  • How it works: Check limb matches, single coherent body, same character, and background preserved.
  • Why it matters: Without it, the person becomes someone else or gets extra limbs.

🍞 Anchor: The same superhero now stands like the reference image but still looks like the same hero.

— New Concept — 🍞 Hook: Rotating a toy car to face the same direction as an arrow.

🥬 Reorientation: Align an object’s facing direction (yaw/pitch/roll) to a cue without changing what it is or breaking visuals.

  • How it works: Judge axis alignment, identity consistency, and visual integrity.
  • Why it matters: Wrong orientation breaks realism; identity swaps are unacceptable.

🍞 Anchor: A camera reoriented to point where the green frustum indicates, still the same camera, no artifacts.

— New Concept — 🍞 Hook: Turning a rough doodle into a finished scene.

🥬 Draft Instantiation: Convert sparse on-image sketches into full details that fit the scene and style.

  • How it works: Use IA, CP, VC again because it’s about following marks, not free styling.
  • Why it matters: Without respecting the draft, the output ignores the designer’s blueprint.

🍞 Anchor: Red lines sketch a new jacket; the model fabricates the jacket that fits the person and matches the art style.

— Causal Level Tasks —

— New Concept — 🍞 Hook: Move a flashlight and watch shadows slide.

🥬 Light Control: Change lighting direction per the arrow, update shadows/highlights, keep everything else the same.

  • How it works: Score lighting-direction consistency and context preservation.
  • Why it matters: If shadows don’t move correctly, it breaks physics.

🍞 Anchor: Arrow from the right yields right-side highlights and left-side shadows.

— New Concept — 🍞 Hook: Turn on a fan and see hair and clothes sway.

🥬 Flow Simulation: Apply wind from the arrow direction; elements respond consistently while identities and placement remain.

  • How it works: Score wind direction consistency and a two-part context preservation (identity and pose/placement).
  • Why it matters: Without consistent flow, the scene looks random.

🍞 Anchor: A flag flutters in the arrow’s direction, but the flagpole stays put.

— New Concept — 🍞 Hook: Predict where a cue ball goes after bouncing off the table.

🥬 Billiards: Draw the trajectory and mark the first ball hit, preserving the setup.

  • How it works: Score path correctness (bounce order), collision correctness (right ball), and context preservation.
  • Why it matters: Tests multi-step physical reasoning with clear right/wrong signals.

🍞 Anchor: The cue ball bounces top-then-left walls and hits ball #3; the output shows that exact sequence.

04Experiments & Results

🍞 Hook: Imagine a report card that shows not just your final grade, but how you did in reading, math, and science separately. That’s what VIBE’s experiments do for editing AIs.

🥬 The Test: VIBE includes 1,034 carefully verified samples across 10 tasks and three levels. Each example comes with an image, text prompt, and visual instruction. The judge (GPT-5.1) scores each result three times and averages the scores to reduce randomness. The main checks are: did you follow the instruction (IA), avoid messing up the background (CP), and keep things seamless (VC)? Morphological and Causal tasks use matching, physics-aware variants of these checks.

🥬 The Competition: 17 models were tested—7 proprietary (like Nano Banana Pro, Nano Banana, GPT-Image-1, Seedream 4.0/4.5, Wan 2.5/2.6) and 10 open-source (like FLUX2-dev, Qwen-Image-Edit family, BAGEL, Step1X-Edit, UniWorld-V1, OmniGen/OmniGen2). This broad lineup shows how different training scales and data shape visual instruction-following.

🥬 The Scoreboard (with context):

  • Deictic Level (point and do): Proprietary models generally score high on Addition, Removal, and Replacement (often 60–95). For example, Nano Banana Pro averages around the mid-70s to 80s across Deictic tasks, signaling strong region-following when marks are explicit. Translation and Reorientation are trickier due to directional reasoning.
  • Morphological Level (reshape to blueprint): Scores drop compared to Deictic. Pose Control and Draft Instantiation show that structured reshaping is doable but fragile; identity preservation and style matching can fail. Still, top proprietary models remain ahead of open-source peers.
  • Causal Level (physics-like effects): The hardest. Even the best models often land below 50%, especially on tasks like Light Control consistency and Billiards trajectory reasoning. This shows present systems lack robust internal “world models.”

Concrete comparisons:

  • Proprietary vs Open-source: Proprietary models consistently lead across all levels. Open-source models lag notably on instruction adherence and visual coherence, indicating room for training and alignment improvements.
  • Style sensitivity: On Deictic tasks, style preferences vary by model (some do better on real photos, others on animation/sketch). On Draft Instantiation, many models do best on animation, likely because clean lines align with sketch guidance.
  • Multi-instruction tests: When combining two or three Deictic tasks in one prompt, scores drop relative to single-task cases (e.g., Nano Banana Pro’s averages slip from mid-80s to mid-70s), revealing challenges in compositional execution.
  • Judge reliability: LMM-as-a-Judge scores correlate strongly with human experts (Pearson r ≈ 0.96), supporting trust in automated evaluation.

🥬 Surprising Findings:

  • Visual instructions can outperform super-detailed text in target localization; a simple box can beat a long paragraph. But for complex semantics (e.g., very specific clothing details), you need both: the visual pointer plus richer text.
  • Some models excel on sketches for Deictic tasks but struggle to keep sketch style in Draft Instantiation, showing style transfer is different from style preservation.
  • Adding tasks together (add+remove+replace in one go) creates failure modes that don’t show up when tasks are isolated—compositionality remains a key weakness.

🍞 Anchor: It’s like students acing worksheets when questions are one at a time but stumbling on mixed quizzes that require juggling steps together. The top kids still do best, but even they slip when physics word problems show up.

05Discussion & Limitations

🍞 Hook: Think of VIBE like a fitness course with stations: sprinting, balance, and puzzle-solving. Most runners ace sprinting, fewer handle balance beams, and puzzles stump many.

🥬 Limitations:

  • Coverage: VIBE spans 10 tasks, but it can’t cover every real-world edit (e.g., fine-grained material swaps, complex reflections, or multi-actor interactions in dynamic scenes).
  • Evaluation Boundaries: Even with strong LMM judging and binary-heavy criteria, some nuanced aesthetics or borderline physics cases are hard to score perfectly.
  • Data Scale: 1,034 curated cases is sizable for careful judging, but more scenarios (e.g., complex occlusions, multiple light sources) would further stress-test reasoning.
  • Model Access: Proprietary models lead; not all are publicly inspectable, limiting ablations on architecture/training.

🥬 Required Resources:

  • You need access to the VIBE dataset, a strong LMM judge (e.g., GPT-5.1), and the image-editing model(s) under test.
  • For training/improvement work, diverse multimodal data with aligned visual marks is critical, as well as compute for diffusion or generative backbones.

🥬 When NOT to Use:

  • If you want to judge pure text-only prompting, VIBE’s focus on visual marks won’t align with your goals.
  • If your tasks demand strict pixel-accurate ground truth (e.g., medical segmentation), VIBE’s open-ended edits and LMM judging may feel too flexible.
  • If your edits involve long temporal video reasoning, VIBE’s single-image tasks might be insufficient.

🥬 Open Questions:

  • How to robustly learn from visual marks so models generalize across styles and domains without overfitting to annotation appearance?
  • What training signals best teach causal consistency (lighting, fluid, rigid-body physics) without heavy simulators?
  • How to scale compositional instruction-following so multiple edits remain disentangled and correctly sequenced?
  • Can visual-thinking traces (sketch-then-edit) help models plan edits more reliably?

🍞 Anchor: VIBE is a strong first map for this territory, but the wilderness is big: more styles, richer physics, and multi-step compositions are next mountains to climb.

06Conclusion & Future Work

🍞 Hook: You know how a good coach doesn’t just say “play better,” but runs drills for passing, shooting, and defense? VIBE does that for image-editing AIs using visual instructions.

🥬 3-Sentence Summary: VIBE is a benchmark that tests how well models follow visual instructions for image editing across three levels—pointing (Deictic), reshaping (Morphological), and cause-and-effect (Causal). It uses a reliable LMM-as-a-Judge with task-specific, mostly binary checks to score instruction-following, background safety, and visual cleanliness. Experiments on 17 models show proprietary systems lead, but everyone’s performance drops as tasks demand more reasoning, especially in causal physics.

🥬 Main Achievement: Turning messy, multimodal editing into a clean, stepwise evaluation ladder with scalable, human-aligned judging—and revealing exactly where today’s models struggle.

🥬 Future Directions: Improve causal world modeling (lighting, wind, rigid-body motion), strengthen style-preserving reshaping, and teach composition so multiple visual instructions can be executed jointly. Explore training with visual marks at scale and integrate visual planning steps (sketch-while-reasoning).

🥬 Why Remember This: VIBE reframes image editing the way humans naturally communicate—by showing and telling—and gives the community a clear scoreboard to chase, moving us closer to image editors that understand a circled spot, a sketched pose, or an arrowed wind just like we do.

Practical Applications

  • •Photo retouching with precision: Draw a box over blemishes to remove only that area without harming nearby details.
  • •Product mockups: Sketch a new handle on a mug and have the model instantiate a coherent, brand-consistent version.
  • •Interior design tweaks: Box a wall and say “add window here,” or reorient a lamp to match the desired lighting direction.
  • •Fashion try-ons: Outline a jacket and replace it with a different style while preserving the person’s identity.
  • •Storyboarding and comics: Provide stick-figure poses for characters; the model converts them into finished panels in the same style.
  • •Marketing edits at scale: Use arrows and boxes to guide batch edits (add logos, replace backgrounds) with minimal text.
  • •Scientific illustration updates: Sketch annotations to add or remove labeled elements cleanly without changing the rest of the figure.
  • •Education demos: Show causal effects—like moving a light source—and let students see shadows and highlights shift.
  • •Game asset iteration: Draft new props via sketches over screenshots and instantiate them in consistent art style.
#visual instruction following#image editing benchmark#deictic grounding#morphological manipulation#causal reasoning#LMM-as-a-judge#instruction adherence#contextual preservation#visual coherence#pose control#reorientation#draft instantiation#light control#flow simulation#billiards trajectory
Version: 1