DreamOmni3: Scribble-based Editing and Generation

Bin Xia; Bohao Peng; Jiyang Liu; Sitong Wu; Jingyao Li; Junjia Huang; Xu Zhao; Yitong Wang; Ruihang Chu; Bei Yu; Jiaya Jia

DreamOmni3: Scribble-based Editing and Generation

Intermediate

Bin Xia, Bohao Peng, Jiyang Liu et al.12/27/2025

arXiv PDF

Key Summary

•DreamOmni3 lets people edit and create images by combining text, example images, and quick hand-drawn scribbles.
•It introduces two new tasks—scribble-based editing and scribble-based generation—so you can point where and how to change or create things.
•Instead of hard-to-manage masks, the model reads both the original image and a scribbled version together, using colors to match instructions to regions.
•A special trick called “same position and index encoding” helps the model line up pixels in the original and scribbled images so edits stay precise.
•The team built a large synthetic dataset and a new benchmark to fairly test these scribble tasks.
•On real-world tests, DreamOmni3 beats strong open-source models and is comparable to commercial systems.
•Ablation studies show the joint input scheme and the encoding tricks clearly improve results, especially for editing.
•This approach makes visual creation easier for everyone, because you don’t have to perfectly describe locations in words.
•The system works with text-only, text+image, or text+image+scribble instructions inside one unified framework.

Why This Research Matters

DreamOmni3 makes image editing and creation more natural by letting you simply draw where and how you want changes. This reduces frustration when words can’t describe exact locations or uncommon items. It also keeps the rest of the picture stable, which is important for real photos, product shots, or design drafts. The approach lowers the barrier for creators, teachers, designers, and kids who think visually. By unifying text, images, and scribbles inside one system, it speeds up workflows and sparks creativity. Strong results against top baselines show it’s not just easier—it’s better. This can power friendlier creative tools across phones, tablets, and desktops.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how, when you want a friend to change a picture—like "make this backpack purple"—you end up pointing with your finger or circling it with a pen so they know exactly where? Words help, but pointing makes it obvious.

🥬 The Concept (Unified generation and editing models): What it is: These are all-in-one AI tools that can both create new pictures and edit existing ones using instructions. How it works: 1) They read your instructions (usually text). 2) They look at your images (if any). 3) They decide how to change or create the picture. 4) They output an edited or brand-new image. Why it matters: Without a single model that can do many tasks, users would have to switch tools all the time and learn many interfaces. 🍞 Anchor: Imagine one art robot that can draw from scratch or fix details on your photo—no swapping robots needed.

🍞 Hook: Picture telling someone, "Change the color of THAT bag," while tapping the photo. Tapping shows location better than a long sentence.

🥬 The Concept (Image editing techniques—inpainting/mask-based editing): What it is: A common way to edit images is to paint a mask over the area to change, and the AI redraws that part. How it works: 1) You cover an area with a mask. 2) You give an instruction (like “make it purple”). 3) The AI regenerates just that area. 4) The rest of the image is kept the same. Why it matters: Without masks, the model might change the wrong spot or too much area. 🍞 Anchor: It’s like putting painter’s tape on a wall so only the taped section gets a new color.

🍞 Hook: Try describing “the man on the left behind the tree next to another man” with words only—tricky! Drawing a quick circle is faster.

🥬 The Problem: What it is: Text-only instructions struggle to pin down exact locations, uncommon object names, or which of many similar things you mean. How it works: 1) Text mentions what to change. 2) But it can’t point like your finger can. 3) Confusion grows with multiple similar objects or fine-grained details. Why it matters: Without a way to point, edits can land on the wrong spot or miss the user’s idea. 🍞 Anchor: Saying “Change the small green leaf near the bottom-left vine” is harder than circling that leaf.

🍞 Hook: Think of a coloring book where you can both write notes and draw arrows to show exactly what you want.

🥬 The Gap (What was missing): What it is: A unified way to combine text, example images, and freehand scribbles to control image edits and generation. How it works: 1) Let users draw simple shapes or doodles on images or blank canvases. 2) Let them also use text and reference images. 3) Let the model understand all of these together. Why it matters: Without scribbles, users can’t easily indicate location or shape; without unifying inputs, things get complicated. 🍞 Anchor: You can say “add a red car here,” show a photo of the exact car you like, and draw a circle where it should go.

🍞 Hook: If you’ve ever drawn arrows and circles on homework to show what to fix, you already know the power of scribbles.

🥬 The Concept (Binary masks vs. scribbles): What it is: Binary masks are black-and-white stencils; scribbles are colorful, quick markings users draw. How it works: Masks strictly carve out areas but get messy with many regions; scribbles use colors and shapes to mark multiple places more naturally. Why it matters: With many edits, masks are clunky; scribbles scale better and are easier to describe. 🍞 Anchor: Instead of cutting out five different paper stencils (masks), you can just draw five colored circles (scribbles) with a marker.

🍞 Hook: Imagine giving the AI a before-and-after pair: the original photo and your marked-up copy with circles and arrows.

🥬 The Transition (What DreamOmni3 adds): What it is: DreamOmni3 brings scribble-based editing and generation into a unified model that also understands text and images. How it works: 1) It reads your original image and your scribbled version. 2) It aligns them carefully. 3) It performs precise edits or generates new images guided by scribbles and words. Why it matters: Without this alignment and unified design, edits would be less accurate and harder to control. 🍞 Anchor: You show the model “this is where to change” directly on the image, and it fixes exactly that spot.

02Core Idea

🍞 Hook: You know how teachers say, “Show your work”? Scribbles are you showing the AI exactly what you mean on the picture itself.

🥬 The Aha! Moment: One-sentence insight: Let the model read both the original image and a scribbled copy together, and teach it to align them with the same position and index codes so it can edit exactly where you drew—no messy masks needed.

Multiple analogies:

Map and marker: The original image is the map; your scribbles are colored markers pointing to spots; the model uses matching coordinates so it doesn’t get lost.
Transparent overlays: Think of laying a clear plastic sheet with your circles over the original photo; now the two layers line up perfectly.
Recipe and sticky notes: The text is the recipe; your scribbles are sticky notes showing exactly which bowl or spot to use—no confusion.

Before vs. After:

Before: Text-only instructions often miss fine locations; masks are hard to manage for many areas and multiple references.
After: Scribbles + text + reference images in one system; colored scribbles make multi-region edits easy; alignment keeps non-edited pixels intact.

Why it works (intuition):

The model simultaneously sees untouched pixels (truth of the scene) and your hints (scribbles). Using the same position and index encoding is like giving both images the same grid labels, so the AI knows, “This red circle on the scribbled image corresponds to these exact pixels on the original.” That means it edits just the right place and keeps the rest steady.

Building Blocks (each explained with the sandwich pattern):

🍞 Hook: Imagine circling the backpack you want purple. 🥬 Scribble-based editing: What it is: Editing an existing image by drawing rough marks to show where and how to change it. How it works: 1) Draw shapes or doodles on the image. 2) Add a text instruction (and maybe a reference image). 3) The model changes that region. Why it matters: Without this, the model guesses location from words alone and can miss. 🍞 Anchor: You circle the bag and say, “Make it purple,” and it turns just that bag purple.
🍞 Hook: Picture a blank page where your doodle tells the AI what to create. 🥬 Scribble-based generation: What it is: Making a full image from a blank canvas using your doodles to place objects and shapes. How it works: 1) Draw simple outlines on white. 2) Add text and/or image references. 3) The model paints a full scene guided by your doodle. Why it matters: Without a doodle, it’s hard to control layout precisely with text alone. 🍞 Anchor: A moonlit garden scene appears where your moon and tree doodles were sketched.
🍞 Hook: Think of handing the AI two photos: the plain one and the one you scribbled on. 🥬 Joint input scheme: What it is: The model reads both the original image and the scribbled image together. How it works: 1) Feed both images in. 2) Keep them aligned with matching encodings. 3) Edit only where scribbled, keep the rest untouched. Why it matters: Without joint input, the scribble might hide pixels, and the AI can’t preserve the original details. 🍞 Anchor: The circle covers part of a face, but because the model also sees the clean original, it restores the face perfectly except for your requested change.
🍞 Hook: Gridlines on a map help you point to a square like “B-7.” 🥬 Position and index encoding: What it is: Special labels that tell the AI where pixels live and which image they belong to. How it works: 1) Give both images matching position codes. 2) Give them matching index codes so the model knows they’re a pair. 3) Keep reference images distinct by shifting codes. Why it matters: Without these codes, the AI could misalign spots, editing the wrong place. 🍞 Anchor: The red circle at (row 50, column 200) is aligned across both images, so the model edits exactly there.
🍞 Hook: A chef makes new dishes by remixing ingredients on hand. 🥬 Data synthesis pipeline: What it is: A process to build lots of training examples using text, images, and scribbles. How it works: 1) Start from a big dataset. 2) Mark regions (with a tool like Referseg). 3) Add circles/boxes/doodles or paste objects. 4) Generate targets that show correct edits. Why it matters: Without enough varied training data, the model won’t learn reliable scribble skills. 🍞 Anchor: The model practices on thousands of “circle here, change color to this” examples until it understands your scribbles.

03Methodology

At a high level: Input (text + images + scribbles) → Encode everything (with joint input for editing) → Align positions and indexes → MM-DiT-based editing/generation → Output image.

Step-by-step with the sandwich pattern for key parts:

🍞 Hook: Like giving directions with both words and arrows on a map. 🥬 Define tasks (what): DreamOmni3 supports seven tasks—four for editing (scribble+text, scribble+multimodal, image fusion, doodle editing) and three for generation (scribble+text, scribble+multimodal, doodle generation). How it works: Each task mixes text, reference images, and scribbles differently to guide the model. Why it matters: Different creative needs call for different combinations; defining them lets us train and test properly. 🍞 Anchor: “Make the coat here match the sweater in photo 2,” with a red circle marking the coat.
🍞 Hook: Think of a sticker book where you cut, color, and paste to practice many skills. 🥬 Data synthesis (how): What it is: A recipe to create large, realistic training pairs. How it works: 1) Use Refseg to find object regions in source/target/reference images. 2) Paste handcrafted circles/boxes of varied shapes as scribbles. 3) For image fusion, crop objects from references and paste into sources. 4) For doodle editing/generation, convert objects into sketches (using a doodle generator) and place them back or on a blank canvas. Why it matters: Without rich, labeled examples that mirror real use, the model won’t generalize. 🍞 Anchor: The dataset includes examples like “add a man in the green circle,” “copy the toy from image 2 to this spot,” and “make the jacket’s colors match this reference.”
🍞 Hook: Sometimes your marker covers part of the photo—so you keep an untouched copy next to it. 🥬 Joint input for editing (key trick): What it is: Feed both the clean source image and the scribbled source image to the model. How it works: 1) Provide two streams of pixels. 2) Use the same position and index encodings so the model knows they align. 3) The model edits based on scribbles while preserving untouched details from the clean image. Why it matters: Without this, scribbles might hide details and cause inaccurate edits. 🍞 Anchor: You drew a thick circle over the hat; the model still knows the hat’s original look from the clean input and edits it correctly.
🍞 Hook: Labeling two sheets with the same grid makes lining them up easy. 🥬 Same position and index encoding (alignment glue): What it is: Matching codes attached to both the clean and scribbled images. How it works: 1) Position encodings give each pixel a location tag. 2) Index encodings mark which image is which and keep pairs synced. 3) Reference images use shifted encodings to stay separate. Why it matters: Without these, the model can mix up pixels or confuse references with sources. 🍞 Anchor: The circle on the scribbled image maps to the exact same spot on the clean image, enabling precise local edits.
🍞 Hook: You can point with scribbles on references too, but you don’t need to preserve every pixel there. 🥬 Input choices (when joint input is used): What it is: Use joint input only when editing the main source image; don’t use it for references or generation from scratch. How it works: 1) Editing the source: use joint input. 2) Reference images with scribbles: pass them once (no joint needed). 3) Generation on blank canvas: pass the scribbled canvas only. Why it matters: Joint input is powerful but adds compute; use it where it helps (editing) and skip it where it doesn’t (generation). 🍞 Anchor: For a blank-canvas doodle scene, there’s no original to preserve—so single input is enough.
🍞 Hook: Like plugging an add-on lens to a camera to gain a new ability without replacing the camera. 🥬 Training with LoRA (light add-on learning): What it is: A way to teach new skills by training a small add-on (LoRA) instead of changing the whole model. How it works: 1) Start with a strong unified model (Kontext + Qwen2.5-VL backbone). 2) Train separate LoRAs for generation and editing using the new data. 3) Activate the LoRA when scribble inputs are present. Why it matters: Without LoRA, you might lose old skills or need huge compute; with LoRA, you keep old abilities and add new ones efficiently. 🍞 Anchor: It’s like snapping a new lens onto your camera only when you need a close-up shot.
🍞 Hook: Grading art class with clear rubrics helps compare fairly. 🥬 Benchmark and evaluation (fair testing): What it is: A new test set and rules for judging success. How it works: 1) Real images and seven tasks. 2) Success if edits follow instructions, look consistent, avoid artifacts, and match scribble regions. 3) Checked by humans and Vision-Language Models (VLMs). Why it matters: Without a good benchmark, we can’t tell if the method truly works. 🍞 Anchor: The model gets a high score only if it edits the circled spot, matches the style or object asked for, and keeps the rest clean.

Concrete data examples:

Editing dataset: ~32K multimodal scribble edits, ~14K scribble+text edits, ~16K image fusion cases, ~8K doodle edits.
Generation dataset: ~29K multimodal scribble generations, ~10K scribble+text generations, ~8K doodle generations.

Secret sauce:

The combination of joint input for editing plus same position/index encoding is the clever core—together they let the model understand your drawn intent while preserving un-touched details with pixel-accurate alignment.

04Experiments & Results

🍞 Hook: Think of a sports league where many teams play the same games, and we keep score with fair referees.

🥬 The Test (what and why): The team built a new DreamOmni3 benchmark with real images covering seven scribble tasks. They judged success by: 1) following instructions, 2) keeping people/objects/attributes consistent, 3) avoiding visual glitches, and 4) matching the scribbled regions. They used both human reviewers and VLMs (Gemini and Doubao) as referees. Why it matters: Without a clear scoreboard, it’s hard to know who really wins at scribble editing and generation. 🍞 Anchor: A case only “passes” if the model edits exactly where you circled and makes the requested change cleanly.

🥬 The Competition (who): Open-source and commercial baselines: Omnigen2, Qwen-image-Edit-2509, Kontext (with multi-image workaround), DreamOmni2, plus GPT-4o and Nano Banana. 🍞 Anchor: Imagine a race between several strong runners, including last year’s champions.

🥬 The Scoreboard (editing):

On the scribble-based editing benchmark, DreamOmni3 scored about 0.5250 (Gemini), 0.4500 (Doubao), and 0.5750 (human), outperforming all open-source models and matching or beating commercial models in human judgment. Context: That’s like getting an A when others get C+ to B-, with the top commercial runner nearby. 🍞 Anchor: Viewers preferred DreamOmni3’s edits more often, especially for accuracy and cleanliness.

🥬 The Scoreboard (generation):

For scribble-based generation, DreamOmni3 scored about 0.5116 (Gemini), 0.4651 (Doubao), and 0.5349 (human), better than open-source baselines and close to GPT-4o. Nano Banana often left scribble marks or looked collage-like. 🍞 Anchor: DreamOmni3 turned doodles into scenes cleanly, while others sometimes left the marker lines in the final picture.

🥬 Surprising findings:

Some commercial models weren’t tuned for scribbles, sometimes copying scribbles into the output. Also, GPT-4o sometimes added yellowing or changed pixels outside edited areas—something VLMs didn’t always penalize strongly but humans noticed. 🍞 Anchor: It’s like a fancy camera that adds a strange tint—you spot it with your eyes even if an automatic checker doesn’t.

Ablation insights (what made the difference):

Joint input matters for editing: Training on the custom dataset alone helped a lot, but adding joint input (clean + scribbled image) boosted editing success further (e.g., to ~0.45 in one set of scores), because it preserves original details hidden by the scribbles. For generation (from blank), joint input helped less, since there’s no original to preserve.
Same position + index encoding works best: Using both together gave the highest scores (e.g., ~0.45 editing, ~0.465 generation in their reported setup), confirming that pixel-perfect alignment between clean and scribbled inputs is key.

Bottom line:

DreamOmni3 set a new bar for scribble-based control, especially for precise editing, and performed competitively with leading commercial systems on generation—while clearly outperforming open-source baselines.

05Discussion & Limitations

🍞 Hook: Even great tools have places where they’re not the best fit—like using a paintbrush to write tiny letters.

🥬 Limitations (be specific): What it is: Situations where performance can drop. How it works: 1) Messy or overly thick scribbles can hide too much detail, making precise edits tougher (though joint input helps). 2) Extremely fine-grained edits (tiny jewelry, micro-text) may still challenge alignment. 3) The model depends on training data variety; rare objects/styles might be harder. 4) Automated VLM scoring may miss subtle artifacts humans catch. Why it matters: Knowing limits helps you decide when to rely on human review or extra refinement. 🍞 Anchor: A giant marker circle over someone’s eye may hide key pixels—joint input helps, but very tiny details might still be tricky.

🥬 Required resources: What it is: What you need to run or train it. How it works: 1) A unified editing/generation backbone (e.g., Kontext + VLM). 2) LoRA training (their runs took ~400 A100 hours). 3) Tools like Refseg for region detection. 4) Storage for the multi-task datasets. Why it matters: Without these, reproducing results or extending the system is hard. 🍞 Anchor: It’s like needing good brushes, paint, and a big canvas to recreate a mural.

🥬 When not to use: What it is: Cases where another method might be better. How it works: 1) If you need pixel-perfect restoration without any generative change, classical restoration/inpainting tools might be safer. 2) For medical or safety-critical images where any hallucination is unacceptable, stick to conservative, verified pipelines. 3) If text alone is enough and locations are unambiguous, simple text-editing may be faster. Why it matters: The right tool for the right job saves time and reduces risk. 🍞 Anchor: Don’t use a freestyle paintbrush for tracing a microchip—it’s the wrong tool.

🥬 Open questions: What it is: Things we still want to learn. How it works: 1) How to further reduce dependence on segmentation tools like Refseg. 2) How to make evaluation less reliant on VLMs and more robust to subtle artifacts. 3) How to scale to video (temporal scribbles) while keeping consistency frame-to-frame. 4) How to combine 3D or depth hints with scribbles for even better spatial control. Why it matters: Solving these brings us closer to fully intuitive, trustworthy creative AI. 🍞 Anchor: Next steps are like adding animation to your drawings—keeping everything aligned as it moves.

06Conclusion & Future Work

🍞 Hook: You know how circling a spot in a photo makes your request crystal clear? That simple idea powers this whole paper.

🥬 Three-sentence summary: DreamOmni3 teaches an AI to understand text, reference images, and your scribbles together, so it can edit or generate images exactly where you point. The key is feeding both the original and the scribbled image (joint input) and giving them the same position and index codes so pixels line up. A new dataset and benchmark show this works better than previous open-source systems and rivals commercial models, especially for precise editing.

Main achievement: Replacing complex mask workflows with a unified scribble-first approach—powered by joint input and matching encodings—that greatly improves controllability and keeps non-edited pixels intact.

Future directions: Expand to video with temporal scribbles; add depth/3D cues; improve evaluation beyond VLMs; reduce reliance on segmentation tools; grow datasets for rare objects and styles.

Why remember this: It turns the most natural human gesture—pointing—into a reliable instruction for AI art, making creative control simple, fast, and accurate for everyone.

Practical Applications

•Photo retouching: Circle blemishes or objects to fix, while preserving everything else.
•Product design: Sketch changes on prototypes (color, material, logo placement) and see instant variations.
•Interior planning: Doodle where to place furniture or art and generate realistic room previews.
•Education: Students draw arrows/circles on diagrams, and the AI edits or generates examples accordingly.
•Storyboarding: Roughly sketch scenes and characters to generate polished frames for films or comics.
•Marketing creatives: Point to exact regions for brand color swaps and layout tweaks with fewer rounds.
•AR/VR prototyping: Draw simple spatial hints for object placement and style, then render scenes.
•Fashion styling: Circle garments and apply reference styles or colors from a catalog image.
•Game asset creation: Doodle object silhouettes and fuse parts from references to generate new assets.
•Personalization: Insert objects from one photo into another (e.g., “put this toy here”) with clean blending.

Version: 1