VIBE: Visual Instruction Based Editor

Grigorii Alekseenko; Aleksandr Gordeev; Irina Tolstykh; Bulat Suleimanov; Vladimir Dokholyan; Georgii Fedorov; Sergey Yakubson; Aleksandra Tsybina; Mikhail Chernyshov; Maksim Kuprashevich

VIBE: Visual Instruction Based Editor

Intermediate

Grigorii Alekseenko, Aleksandr Gordeev, Irina Tolstykh et al.1/5/2026

arXiv PDF

Key Summary

•VIBE is a tiny but mighty image editor that listens to your words and changes pictures while keeping the original photo intact unless you ask otherwise.
•It uses a small 2-billion-parameter vision-language model (VLM) to understand your instruction and a 1.6-billion-parameter diffusion model to actually draw the edit.
•A fast trick called channel-wise concatenation lets VIBE use the original image without slowing down the transformer’s attention, keeping edits quick.
•Special meta tokens act like bookmarks inside the VLM so it can focus on your exact edit request and pass a clean signal to the image generator.
•Training happens in four stages: connector alignment, big pretraining, supervised fine-tuning, and DPO (learning from preferences) to polish quality and instruction-following.
•VIBE is optimized for strict source consistency—if you don’t ask for a change, it tries very hard not to change it.
•On major benchmarks (ImgEdit and GEdit), VIBE matches or beats much larger models on many edit types, especially attribute tweaks, removals, and background edits.
•It runs on a single 24 GB GPU and can make 2K images in about 4 seconds on an NVIDIA H100 (BF16), making it practical and affordable.
•Clever data design (triplet inversion, bootstrapping, and real-world instructions) helps VIBE learn realistic, user-like edits.
•The biggest remaining challenges are complex action-style edits with large geometric changes and polishing tiny visual details.

Why This Research Matters

Fast, faithful image editing unlocks creativity for everyone, not just experts with pro software. With VIBE, you can describe the change you want and get it in seconds, while the rest of your photo stays untouched. This is valuable for students, small businesses, and creators who need quick, reliable results without expensive hardware. E-commerce can update product images (like colors or backgrounds) safely and consistently. Media teams can make precise corrections while preserving identity and layout. By proving small models can perform at a high level, VIBE lowers costs and speeds up iteration, bringing high-quality visual editing to more users and devices.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine asking a friend, “Can you make the sky more pink and remove that sign?” and they do it perfectly while keeping everything else the same. That’s what we want from image editing with words.

🥬 Filling (The Actual Concept)

What it is: Instruction-based image editing is when an AI changes a picture exactly according to your text instruction.
How it works:
1. You give the AI a photo and a sentence like “Make the shirt red and remove the logo.”
2. The AI reads your words and looks at the image to figure out where and what to change.
3. It redraws just the requested parts while keeping everything else stable.
Why it matters: Without this, people need complex tools and lots of skill to do precise edits; with it, anyone can describe the change in plain language.

🍞 Bottom Bread (Anchor) You say, “Brighten the living room and remove the coffee cup on the table.” The AI increases brightness and erases only the cup, leaving the sofa, rug, and window exactly the same.

The World Before Before systems like VIBE, two common paths existed:

Traditional tools: Powerful but hard for non-experts; users must make precise selections, layers, masks, and adjustments by hand.
Early AI editors: Either training-free tricks (like attention steering) that were fast but often clumsy, or big trained diffusion models that were powerful but very heavy and slow (often 6–20 billion parameters). These approaches struggled with two big pains: (1) efficiency—running on a single affordable GPU was hard, and (2) source consistency—keeping everything unchanged except what you asked for.

The Problem Make a system that:

Understands your instruction in the context of the specific photo (not just general text),
Preserves all unmentioned details (identity, layout, lighting), and
Runs quickly and cheaply on common hardware.

🍞 Top Bread (Hook) You know how a good helper listens to both your words and looks at what you’re pointing to? If they don’t look, they might change the wrong thing.

🥬 Filling (The Actual Concept)

What it is: A Vision-Language Model (VLM) is an AI that understands both pictures and words together.
How it works:
1. It reads the instruction and sees the image at the same time.
2. It decides what parts of the picture your words refer to.
3. It creates a guidance signal telling the generator exactly what to change.
Why it matters: Without a VLM, the editor might misinterpret your request because it can’t see the actual photo while reading the text.

🍞 Bottom Bread (Anchor) You ask, “Remove the red sticker on the left laptop.” The VLM checks the image, finds the left laptop (not the right), spots the red sticker, and tells the painter model to erase just that.

Failed Attempts

Only-text conditioning: Many diffusion backbones only read text, not the source photo—so they can’t ground instructions in what’s really there.
Heavy multimodal backbones: Putting everything in one giant model works, but it’s expensive and slow for everyday use.
Noisy training data: Big but messy datasets made models pick up bad habits (artifacts, unintended changes) that later training couldn’t fully fix.

The Gap We needed a middle path: a compact, fast system that still reasons about both the photo and the instruction and that is trained with a disciplined recipe to avoid forgetting quality and to keep edits minimal and faithful.

🍞 Top Bread (Hook) Think of a careful librarian who makes only the exact edits you ask for on a book page and refuses to smudge any other lines.

🥬 Filling (The Actual Concept)

What it is: Strict source consistency means the AI changes only what you requested and preserves everything else.
How it works:
1. Train with data that punishes unintended changes.
2. Use models that read both text and the actual image.
3. Filter out examples with sneaky shifts (like face movements) and low-quality artifacts.
Why it matters: Without this, even a good-looking edit can be wrong if it moves faces, alters identity, or redraws untouched areas.

🍞 Bottom Bread (Anchor) Ask, “Replace the sky with a sunset but keep the building the same.” A strictly consistent editor changes the sky only; the building’s bricks, edges, and windows don’t move or blur.

Real Stakes

Everyday creators: Fast, language-driven edits for social posts and design mockups.
E-commerce: Swap product colors or remove distractions without disturbing the rest of the photo.
Photography: Quick retouching that respects the original scene and identity.
Education and media: Reliable visuals where changes are traceable and limited.
Accessibility: People who can describe changes can edit without advanced software skills.

VIBE’s story is about doing more with less: a compact VLM plus a compact diffusion model, trained with a carefully staged process and cleaned, reality-based data, to reach high-quality, faithful edits—fast.

02Core Idea

The “Aha!” Moment in One Sentence Use a small vision-language model to look at the instruction and the image together, pack its understanding into special meta tokens, convert that signal with a lightweight connector, and guide a compact diffusion model that edits only what you asked—quickly and consistently.

Multiple Analogies

Movie director and camera crew: The VLM is the director who reads the script (instruction) while watching the scene (image), the connector is the walkie-talkie translating the director’s notes, and the diffusion model is the camera crew that shoots the exact reshoot.
GPS for painting: The VLM plans the route (what to change and where), the connector turns it into precise turn-by-turn directions, and the diffusion model drives the brush to the target without wandering.
Recipe handoff: The VLM tastes the dish (image) and reads your tweak request, the connector rewrites that into the kitchen’s prep notes, and the diffusion model is the chef making that exact adjustment—no surprise ingredients.

🍞 Top Bread (Hook) You know how sticky notes help you remember key points while reading a long chapter?

🥬 Filling (The Actual Concept)

What it is: Meta tokens are special learnable tokens inserted into the VLM so it can produce a compact, edit-aware representation for the image generator.
How it works:
1. Insert N special meta tokens alongside your instruction.
2. The VLM processes the photo + your text + these meta tokens together.
3. The meta tokens come out carrying the “meaning of the edit in this specific image.”
Why it matters: Without meta tokens, the VLM’s last-layer states can be a poor fit for the diffusion model, leading to weak or ambiguous guidance.

🍞 Bottom Bread (Anchor) Instruction: “Make the shirt red, keep everything else the same.” The meta tokens carry that exact plan—“shirt region → red; do not touch other areas.”

🍞 Top Bread (Hook) Imagine you could add a new adapter cable so your headphones fit a different device perfectly.

🥬 Filling (The Actual Concept)

What it is: The connector is a small transformer that maps the VLM’s meta-token states into the diffusion model’s conditioning space.
How it works:
1. Take the meta-token embeddings from the VLM.
2. Pass them through a few lightweight transformer blocks (4 layers works best here).
3. Output features the diffusion model can use during denoising.
Why it matters: Without a connector, the VLM’s message won’t match the diffusion model’s “language,” causing instability or quality loss.

🍞 Bottom Bread (Anchor) The connector turns “change the left mug to blue” into a format the painter (diffusion model) instantly understands, so it edits the left mug—not the table.

🍞 Top Bread (Hook) Think of sliding a clear sheet on top of a drawing—adding just one extra layer instead of doubling all the pages.

🥬 Filling (The Actual Concept)

What it is: Channel-wise concatenation is a fast way to feed the source image latent into the diffusion model without increasing the attention sequence length.
How it works:
1. Encode the source image into a latent grid.
2. Concatenate this latent with the noisy latent along channels (not tokens).
3. Use a widened input conv to bring it back to the right shape, then continue as usual.
Why it matters: Without this, sequence-wise concatenation makes the token list longer and slows attention a lot.

🍞 Bottom Bread (Anchor) VIBE keeps the number of tokens the same, so attention stays fast while still seeing the original image.

Before vs After

Before: Either huge unified models (costly) or text-only conditioning (confused edits), and training that often forgot aesthetic quality or drifted away from faithful edits.
After: A compact VLM + compact diffusion, bridged by meta tokens and a small connector, trained in four disciplined stages to preserve both quality and faithfulness at speed.

Why It Works (Intuition)

The VLM sees the image and the text together, so ambiguity drops.
Meta tokens give the VLM a dedicated “slot” to encode the edit intent.
The connector translates that intent into the painter’s native dialect.
Channel-wise guidance keeps inference fast, enabling real-time iteration.
Multi-task training with text-to-image prevents the editor from forgetting how to generate new content when needed.
DPO uses preference pairs to steadily nudge outputs toward faithful and pretty results without a separate reward model.

Building Blocks

A compact VLM (Qwen3-VL-2B) to read image+instruction.
Learnable meta tokens inside the VLM to hold edit intent.
A 4-layer connector mapping meta tokens to diffusion conditioning.
A compact diffusion transformer (Sana1.5-1.6B) for fast, high-res synthesis.
Channel-wise concatenation to inject the source image latency-free.
Four training stages: alignment → pretrain → SFT → DPO.
Strict data filtering and real-world-style instructions to enforce source consistency and usability.

🍞 Top Bread (Hook) You know how voting helps decide which design everyone likes better?

🥬 Filling (The Actual Concept)

What it is: Direct Preference Optimization (DPO) teaches the model from pairs of “better vs. worse” edits instead of hard-to-design scores.
How it works:
1. For the same instruction, compare two edits.
2. Mark which one is preferred (clearer, more faithful, fewer artifacts).
3. Train the model to move toward the preferred and away from the rejected.
Why it matters: Without preference learning, the model might keep tiny, annoying artifacts or misunderstand tricky requests.

🍞 Bottom Bread (Anchor) Between “make the sky purple” results, one edit is clean and accurate, the other shifts the building. DPO learns to choose the clean, accurate one next time.

03Methodology

At a High Level: Input → VLM with Meta Tokens → Connector → Diffusion with Channel-wise Guidance → Edited Image

Step 0: Ingredients

Inputs: A source image and a natural-language instruction.
Tools: Qwen3-VL-2B-Instruct (VLM), 224 meta tokens, a 4-layer connector, and Sana1.5-1.6B diffusion model.
Goal: High-quality, fast, strictly faithful edits up to 2K resolution.

🍞 Top Bread (Hook) Imagine teaching two teammates to cooperate by first agreeing on a handshake so they don’t step on each other’s toes.

🥬 Filling (The Actual Concept)

What it is: Connector alignment is a warm-up stage that teaches the connector to speak the diffusion model’s language before full editing training.
How it works:
1. Freeze the VLM and diffusion model.
2. Train only the connector and meta tokens on a text-to-image task with high-aesthetic data.
3. Stop when the connector produces stable, useful conditioning.
Why it matters: Without this, early training scrambles the diffusion model and harms its generative quality.

🍞 Bottom Bread (Anchor) Before editing, we practice by generating pretty images from text so the connector learns a clean handshake.

Step 1: Reference Image Guidance (Channel-wise)

What happens: Encode the source image into a latent grid LR using a VAE. Concatenate LR to the noisy latent along channels.
Why it exists: This lets the diffusion model “see” the original image without lengthening the token sequence, keeping attention fast.
Example: Editing “change the mug to blue” keeps the same number of tokens, so inference is quick.

🍞 Top Bread (Hook) Think of putting sticky tabs in a book before you hand it to a friend so they can quickly find the right spots.

🥬 Filling (The Actual Concept)

What it is: VLM with meta tokens for text+image understanding.
How it works:
1. Add 224 trainable meta tokens to the VLM’s input.
2. Feed in the instruction and the image together, with a helpful prefix like “What would the image look like if {instruction}?”
3. The VLM outputs contextualized meta-token states that summarize the edit in context.
Why it matters: Without meta tokens, the model’s guidance is blurrier and less aligned to the image content.

🍞 Bottom Bread (Anchor) Instruction: “Remove the left sticker.” The meta tokens focus the VLM on the left side and on the sticker, not other logos.

Step 2: Connector (4-layer Transformer)

What happens: Pass the VLM’s meta-token states through the connector, producing conditioning features for the diffusion model.
Why it exists: The VLM and diffusion model speak different “dialects.” The connector is the translator.
Example: It turns “make the shirt red” into signals the diffusion model uses at each denoising step.

Step 3: Multi-Stage Training

Stage A: Alignment (done at 512px). Train only the connector + meta tokens on text-to-image until stable.
Stage B: Pre-training (≤1024px). Jointly train editing and T2I tasks with mixed aspect ratios; update diffusion + connector + meta tokens; keep VLM frozen.
Stage C: Supervised Fine-Tuning (≤2048px). Use strictly filtered, high-quality triplets (plus T2I) to polish faithfulness and remove artifacts.
Stage D: Preference Alignment (DPO). Use preference pairs to improve instruction adherence and aesthetic quality.

🍞 Top Bread (Hook) You know how practicing both shooting and passing helps a soccer player be balanced?

🥬 Filling (The Actual Concept)

What it is: Mixed T2I+editing training keeps the model creative and faithful at the same time.
How it works:
1. In each batch, include both editing triplets and pure text-to-image samples.
2. For T2I, feed a blank image that’s ignored by attention, and use a template like “generate the image by description: {prompt}”.
3. For editing, use “what will this image be like if {prompt}”.
Why it matters: Without T2I mixed in, the model overfits to narrow editing data and forgets how to synthesize new content cleanly.

🍞 Bottom Bread (Anchor) When you ask to add a cat, the model still knows how to draw a good cat because it kept practicing T2I during training.

Resolution Strategy and Batching

Train at mixed resolutions (384–2048px) with diverse aspect ratios in both pre-training and SFT.
Adaptive batch sizing: Larger batches for lower resolutions to fully use the GPU.
Why: This preserves high-res priors, avoids upscaling harm, and improves convergence.

Data Construction and Filtering (Highlights)

Big, clean triplets: ≈7.7M for pretraining; ≈6.8M for SFT; plus 48M aesthetic T2I images.
Realism boosts: Tripod-captured before/after photos; static-camera video frames for realistic additions; virtual try-on composites; object-level and full-image stylization with inversions.
Automatic mining with validators: Over-generate candidates and keep only those that pass quality checks.
Face IoU filter: Remove pairs where faces shift too much (IoU < 0.9) to prevent artifacts.
Homography alignment: Fix tiny geometric misalignments so the model learns the edit, not mis-registrations.
Just-in-Time augmentations: Reversible photometric changes (blur/deblur, noise/denoise, grayscale/colorize), identity mapping (“do nothing”), and careful mirroring.

🍞 Top Bread (Hook) Think of taking a path forward and then learning to walk it backward—it doubles your practice without extra trips.

🥬 Filling (The Actual Concept)

What it is: Triplet inversion and compositional bootstrapping reuse edits to create reverse and cross-edits.
How it works:
1. Inversion: From edited image back to original with the reverse instruction.
2. Composition: From edit A to edit B via “undo A then do B,” packed as one combined instruction.
3. This multiplies training signals without new images.
Why it matters: Without this, data collection is slower and narrower; with it, the model learns transitions and reversibility.

🍞 Bottom Bread (Anchor) From “add a hat” (A) and “change shirt to blue” (B) on the same person, you can train A→B and B→A, plus invert each to recover the original.

🍞 Top Bread (Hook) When judging a contest, you keep only entries that are clearly better in every required way.

🥬 Filling (The Actual Concept)

What it is: Strict-dominance preference pairing for DPO picks winners that beat losers on both instruction-following and aesthetics.
How it works:
1. Generate candidate edits from multiple prompts or models.
2. Keep pairs only if one image is strictly better on both axes.
3. Train the model using these pairs so it learns balanced improvements.
Why it matters: Without strict dominance, scalarized scores can over-optimize one goal (e.g., pretty) and hurt the other (faithful).

🍞 Bottom Bread (Anchor) Between two outputs, we prefer the one that both follows “make the cup blue” exactly and also looks cleaner—only those pairs train DPO.

Secret Sauce (What’s Clever)

Meta tokens + small connector give strong, image-aware guidance without heavy architectures.
Channel-wise source injection keeps attention cost flat, so edits are fast.
Mixed T2I/edit training prevents forgetting and keeps additions high-quality.
Strict data cleaning (face IoU, homography) enforces source consistency.
DPO with strict-dominance pairs aligns both faithfulness and looks.

Result: A compact stack that behaves like a careful, fast editor.

04Experiments & Results

The Test: What Was Measured and Why

Benchmarks: ImgEdit (1–5 scoring across edit types) and GEdit (0–10 for Semantic Consistency, Perceptual Quality, and Overall).
Why these: They check both “Did you follow the instruction?” and “Does the image still look good?”—the two main goals of a practical editor.

The Competition

Classic trained editors: InstructPix2Pix, MagicBrush, AnyEdit, UltraEdit.
Modern unified or larger editors: FLUX.1 Kontext [Dev], Z-Image, OmniGen/OmniGen2, BAGEL, Step1X-Edit, UniWorld-V1.
Many of these use bigger backbones (6B–20B), so they’re heavier and slower.

Scoreboard with Context

ImgEdit Overall: VIBE scores 3.85 (second place among listed models). Think of it as an A- when most get B’s, and one top student gets an A.
Category wins (ImgEdit): • Adjust: 4.22 (top-tier) • Remove: 4.42 (top-tier) • Background: 4.22 (top-tier) • Strong on Replace, Extract, and Hybrid too. Interpretation: These are the edits where strict source consistency matters most—VIBE excels there.
GEdit-Bench-EN: • Semantic Consistency: 7.91 (second-highest), like getting 92/100 on “Did you do exactly what was asked?” • Perceptual Quality: 6.33 (good, but behind the very prettiest), like 80/100—issues are mostly tiny artifacts, not big misunderstandings. • Overall: 6.81 (competitive top-tier).

Surprising (and Useful) Findings

Sequence-wise vs. channel-wise: Sequence-wise guidance can score slightly higher but slows inference a lot. With Sana’s linear attention, time roughly doubled; with quadratic attention, worse. In practice, channel-wise was better for user experience, and re-sampling a couple times often matched results.
Meta tokens beat alternatives: Using meta tokens with a small connector improved instruction-following over Q-Former and native text-only conditioning.
Connector depth sweet spot: Four transformer blocks worked best; deeper didn’t help and added cost.
Mixed T2I+edit training helps additions: Keeping T2I alive during training prevented the model from forgetting how to synthesize new objects realistically.
Strict-dominance DPO pairs: Better balance between faithfulness and looks compared to mixing multiple objectives into one score.

Speed and Footprint

Hardware: Fits in 24 GB GPU memory.
Throughput: ∼4 seconds per 2K edit on an NVIDIA H100 (BF16), without extra distillation or inference tricks.
Meaning: Interactive edits and iterations are affordable, so users can try a few variants quickly.

What the Numbers Mean for a User

If you want careful tweaks—change color, remove an item, clean up the background—VIBE is among the most reliable, even against bigger models.
If you want wild action poses or huge geometry changes, massive models might still edge out VIBE, though at higher cost and latency.

Takeaway VIBE proves a compact, efficiency-first stack can hang with the big players and even lead in areas that demand precision and respect for the original image.

05Discussion & Limitations

Limitations (Be Specific)

Complex action edits: Large, non-local geometry changes (e.g., “make the person jump and turn 45° while moving the camera angle”) are still tough for a compact model.
Fine aesthetic polish: Some tiny artifacts remain, so the very top perceptual quality lags models optimized purely for looks.
Real photo diversity: Old phone shots, motion blur, and odd lighting can be harder than clean, generated images.
Frozen VLM: Keeping the VLM frozen preserves knowledge but may limit ultimate instruction grounding for niche cases.
Bias not fully audited: As with many pipelines built on large pretrained parts and mined data, bias/fairness needs deeper study.

Required Resources

One 24 GB GPU can run inference with 2K outputs in ~4s.
Training involves millions of triplets and T2I images plus validators; reproducing full training needs multiple GPUs and careful data pipelines.

When NOT to Use

Demanding cinematic rewrites (big pose/camera changes) where large, unified models have an edge.
Ultra-stylized global transforms if you must keep every tiny detail of the original (style transfers can push against strict consistency).
Forensic or safety-critical editing where any artifact is unacceptable (human review still needed).

Open Questions

Distillation: Can we shrink inference steps and drop CFG to get sub-2s edits without losing quality?
Quantization: How far can we compress weights while keeping faithfulness?
More real-world signal: What’s the right mix of tripod photos, video frames, and mined data to toughen performance on messy, real pictures?
VLM fine-tuning: If we unfreeze part of the VLM, can we boost tricky instruction grounding without overfitting?
Better multi-objective alignment: Beyond strict-dominance DPO, can we learn a stable frontier between faithfulness and beauty that adapts per user?

Honest Bottom Line VIBE is a careful, compact editor that shines on faithful, local-to-mid edits. For grand, cinematic rewrites, bigger models still help—but at much higher cost.

06Conclusion & Future Work

3-Sentence Summary VIBE is a compact, visual instruction-based editor that uses a small VLM plus a small diffusion model, connected by meta tokens and a lightweight transformer, to change only what you ask—fast. A four-stage training recipe (alignment, pretrain, SFT, DPO), strict data filtering, and real-world-style instructions keep edits faithful and high quality. On major benchmarks, VIBE competes with or beats much larger models on many core edits while fitting in 24 GB and delivering 2K results in about 4 seconds.

Main Achievement Showing that careful design—meta-token guidance, a small connector, channel-wise source injection, mixed T2I+edit training, and strict-dominance DPO—lets a 2B+1.6B stack achieve production-like, source-consistent editing quality under tight compute.

Future Directions

Distill for fewer diffusion steps and reduced CFG; explore quantization for speed and memory.
Increase real-world triplets (tripods, static video) and improve validators for even stricter consistency.
Experiment with partial or full VLM fine-tuning to deepen image-aware instruction grounding.

Why Remember This VIBE flips the script: you don’t need a giant model to get reliable, faithful edits. With the right guidance and data discipline, small can be strong—and fast.

Practical Applications

•Quick product photo cleanup: remove background clutter while keeping the product pixel-accurate.
•Color and attribute adjustments: change clothing color or material without altering body shape or pose.
•Targeted object edits: remove a sign, add a small prop, or swap a mug’s design.
•Portrait retouching with identity safety: brighten lighting, reduce blemishes, or adjust expression slightly without face drift.
•Real estate photo fixing: replace a dull sky, clean up lawns, or remove small distractions while keeping structures aligned.
•Marketing content iteration: generate multiple faithful variations (e.g., different label colors) quickly for A/B tests.
•Education and journalism: apply minimal, transparent changes (e.g., blur a license plate) without modifying unrelated areas.
•Virtual try-on previews: swap garments on a person while preserving skin tone and background.
•UI/UX mockups: tweak icons, text blocks, or backgrounds precisely without redrawing the entire layout.
•Batch processing on modest hardware: run faithful edits at 2K on a single 24 GB GPU for studio pipelines.

Version: 1