Alterbute: Editing Intrinsic Attributes of Objects in Images

Tal Reiss; Daniel Winter; Matan Cohen; Alex Rav-Acha; Yael Pritch; Ariel Shamir; Yedid Hoshen

Alterbute: Editing Intrinsic Attributes of Objects in Images

Intermediate

Tal Reiss, Daniel Winter, Matan Cohen et al.1/15/2026

arXiv PDF

Key Summary

•Alterbute is a diffusion-based method that changes an object's intrinsic attributes (color, texture, material, shape) in a photo while keeping the object's identity and the scene intact.
•It introduces Visual Named Entities (VNEs) like “Porsche 911 Carrera” or “IKEA LACK table” to define identity at a sweet spot between too-broad categories and too-rigid instances.
•Instead of demanding rare data where only intrinsic attributes change, Alterbute relaxes training to learn from pairs where both intrinsic and extrinsic factors may change, then locks extrinsic factors at inference.
•A special input grid places a noisy target on the left and an identity reference on the right so the model can copy identity details but only denoise the left side.
•At inference, Alterbute reuses the original background and an object mask so only the requested intrinsic attribute changes.
•VNE labels and attribute descriptions are automatically extracted from OpenImages using a vision-language model, producing large-scale identity-consistent supervision.
•Across user studies and evaluations against seven strong baselines, Alterbute’s edits were preferred most of the time (often above 85%).
•It supports precise edits (color/material/texture) and even shape changes, though shape edits on rigid objects can still be tricky.
•Bounding-box masks enable reshaping but can add minor background artifacts; precise masks give cleaner results.
•Alterbute runs as a single unified model for all intrinsic attributes, avoiding per-attribute tools or per-object tuning.

Why This Research Matters

This approach lets people reliably show the same product in many finishes or colors without reshooting photos, saving cost and time. Designers can try new materials and textures quickly while keeping the product’s recognizable identity. Shoppers can preview real product variants in the same room lighting, improving confidence and reducing returns. AR and virtual staging become more believable because the identity and context stay stable. Brands maintain consistency while exploring creative variations. Educators and storytellers can adjust looks (like costumes or props) without changing character identity. Overall, it brings fine-grained, identity-safe control to visual editing, making digital creation more useful and trustworthy.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re customizing a toy car in a photo. You want it red instead of blue, maybe with a leather-like finish, but you still want it to look like that same exact model of car, in the same driveway, at the same angle. That sounds simple, but it’s surprisingly hard for computers.

🥬 The World Before: For years, image editing AIs got good at changing easy, outside-the-object stuff—like backgrounds, lighting, and style—because those changes don’t usually mess with what the object is. But when you asked them to change what the object is made of (material), how it’s patterned (texture), its exact color, or even its shape, they often broke the object’s identity. Turning “this Porsche 911 Carrera” into “some car that looks kind of similar” isn’t good enough when you care about the exact model.

🥬 The Problem: The real challenge is editing intrinsic attributes—color, texture, material, and shape—without losing the object’s identity or the scene’s context. If you change too much, it stops looking like the same product model. If you change too little, the edit isn’t useful.

🥬 Failed Attempts: Two main families struggled here. Unsupervised, prompt-only diffusion editors relied on vague model priors, so identity often drifted—they preserved only coarse categories like “car,” not exact product lines. On the other side, subject-driven/personalization methods locked the identity down so tightly (instance-level) that you couldn’t meaningfully change intrinsic attributes at all—changing paint color or material would confuse or be blocked.

🥬 The Gap: What’s missing is an identity definition that’s specific enough to feel like “the same product line” but loose enough to allow color/material/texture/shape edits. Also missing is a way to train with real image pairs without requiring almost-impossible datasets where the scene stays exactly the same but only intrinsic attributes change.

🥬 The Stakes: This matters in everyday life. Think e-commerce (show the same chair in oak, walnut, or metal), product design (try new finishes fast), real estate staging (change countertop material), AR try-ons (different shoe colorways), and visual storytelling (consistent characters with wardrobe/texture changes). If AI can do precise, identity-safe edits, people save time, reduce photoshoots, and explore ideas quickly.

🍞 Anchor: Think of editing a sneaker photo from white leather to black suede, keeping it unmistakably the same Nike model in the same room. Before: tools either changed too much (identity drift) or refused to change enough (locked identity). After: Alterbute does the exact change you asked for while keeping everything else—identity and background—steady.

02Core Idea

🍞 Hook: You know how a school uniform still signals the same school even if the shirt color varies by season? That’s the trick: allow certain parts to change while keeping the defining parts constant.

🥬 The Aha Moment (one sentence): Train the model on easier-to-find examples where both intrinsic and extrinsic factors may change, but at inference time freeze the scene and location so only intrinsic attributes change—using Visual Named Entities (VNEs) to define identity at just the right level.

🥬 Multiple Analogies:

Recipe swap: Keep the same cake recipe (identity) but swap frosting color or sprinkles (intrinsic) while serving on the same plate at the same table (extrinsic fixed at inference).
Team jersey: The team identity (VNE) stays the same even if you change jersey color or fabric; the stadium and camera angle (extrinsic) are fixed on game day so only the jersey attribute changes.
Lego model: It’s the same Lego set model (identity), but you can replace certain bricks (color/material) while placing it in the same spot on the shelf (extrinsic fixed), so it’s clearly the same set.

🥬 Before vs After:

Before: Editors either drifted identity (too loose) or blocked intrinsic changes (too strict). Data for intrinsic-only edits were nearly impossible to find.
After: By relaxing training to accept both intrinsic and extrinsic changes, the model can actually learn from abundant real pairs. Then at inference, reusing the original background and object mask locks extrinsic factors so only the requested intrinsic attribute updates. VNEs define identity precisely enough to feel like the same product line (e.g., “Porsche 911 Carrera”), not just any car.

🥬 Why It Works (intuition, no equations):

Data availability: It’s easy to find pairs where many things changed; it’s hard to find pairs where only intrinsic attributes changed. Train on the easy-to-find pairs.
Inference constraint: When it’s time to edit your image, you reuse your own background and mask—this acts like a seatbelt, preventing unwanted scene or pose changes.
Identity sweet spot: VNEs group visually consistent products within a manufacturing line, letting intrinsic attributes vary naturally without breaking identity.
Information routing: A two-panel grid (target-noisy-left, identity-reference-right) lets attention move identity details from right to left while the loss focuses only on the left—copy identity, edit attributes.

🥬 Building Blocks (each as a mini sandwich):

🍞 You know how some names refer to a specific model line, like “iPhone 16 Pro”? 🥬 Concept: Visual Named Entities (VNEs) are fine-grained product-line names that define identity while allowing intrinsic variation. Why it matters: Without VNEs, identity is either too broad (drifts) or too strict (no edits). 🍞 Example: “IKEA LACK table” across different colors.
🍞 Imagine a worksheet split into two columns. 🥬 Concept: Input grid—left is the noisy target to edit; right is the clean identity reference. The model only learns to fix the left side but can peek at the right to keep identity consistent. Why it matters: Without this, identity cues don’t transfer cleanly. 🍞 Example: Car to be recolored on the left; the same model car as reference on the right.
🍞 Think of using painter’s tape. 🥬 Concept: Masks and background reuse lock the scene so only the inside gets painted (edited). Why it matters: Without masking and background reuse, the model might shift the scene or pose. 🍞 Example: Keep the living room unchanged while you change the sofa fabric.
🍞 Picture a coach sometimes whispering and sometimes silent. 🥬 Concept: Classifier-free guidance balances listening to the text prompt versus image cues. Why it matters: Without it, edits can be weak or identity can overpower the prompt. 🍞 Example: Ensures “make it wood” shows real wood grain but still looks like the same chair.

🍞 Anchor: Editing a “Porsche 911 Carrera” from silver metal to cherry-red paint with a carbon-fiber texture hint, in the same driveway and angle. VNE keeps the model line; the grid passes identity; the mask and background freeze the scene; guidance makes the new material show up clearly.

03Methodology

High-level overview: Input image → Build conditions (identity reference, prompt, background, mask) → Grid the inputs (noisy target left, reference right) → Diffusion denoising guided by text and images → Output edited image with only the specified intrinsic attribute changed.

Step-by-step (like a recipe):

Prepare inputs

What happens: From the input photo, you segment the object to get a binary mask. You also crop the object and remove its background to make an identity reference (just the object). The original image with the object region gray-masked becomes the background input. You write a very short text prompt for the attribute to change, like “color: red” or “material: wood”.
Why it exists: The mask and background lock extrinsic context; the identity crop centers the model on what to preserve; the text tells exactly which intrinsic attribute to alter. Without this, the scene might shift and identity might drift.
Example: Photo of a blue chair. Mask isolates the chair. Reference is the chair cut out. Background is the room with a gray hole where the chair was. Prompt: “material: leather”.

Build the training grid

What happens: During training, inputs are placed in a 1×2 grid at 512×1024. Left half: the noisy latent of the target. Right half: a reference object from the same VNE cluster. Text prompt embeds through a text encoder; mask and gray-background are concatenated as channels on the left half. Loss is applied only on the left.
Why it exists: The grid lets attention move identity details from right to left; computing loss only on the left forces learning to reconstruct the target with the desired attribute. Without the grid, identity transfer is weaker; without left-only loss, the model might waste effort on the reference.
Example: Left shows a noisy motorbike to be made “color: matte black.” Right shows a clean crop of the same VNE motorbike model.

Relaxed training objective

What happens: The model is trained on pairs where both intrinsic and extrinsic attributes may differ. It learns to produce a coherent object with specified attributes, placed into the provided background at the mask location.
Why it exists: Real-world pairs where only intrinsic attributes changed are rare. Relaxing the objective makes training feasible at scale. Without this, you couldn’t supervise the task well.
Example: Two cups from the same product line but in different kitchens and slightly different lighting; the model learns identity from the product line, not from exact scene matches.

Identity via VNE clusters

What happens: A vision-language model (VLM) scans a large dataset (OpenImages) and assigns fine-grained product-line labels (VNEs) to object crops and extracts intrinsic descriptions (color/texture/material/shape). Images with the same VNE form a cluster used to sample identity references.
Why it exists: VNEs strike the right balance—same product line identity with natural intrinsic variety—exactly what the model must learn. Without VNEs, you either get identity drift (too broad) or no edit freedom (too narrow).
Example: A cluster for “IKEA LACK table” containing variants in black, white, and birch finishes across many rooms.

Diffusion denoising with multimodal conditioning

What happens: The UNet denoiser (based on SDXL) takes the noisy left half plus text, mask, and background. Cross-attention layers fuse the text prompt. Self-attention flows across the grid, letting identity cues move from the right reference to the left target while the loss cleans only the left.
Why it exists: Attention is the pathway for identity details; text is the switch for the specific intrinsic change; mask/background are the rails keeping the scene on track. Without these, edits might be off-target or identity might shift.
Example: Prompt “texture: herringbone fabric” causes the model to synthesize a plausible fabric weave over the same sofa silhouette.

Inference-time constraints

What happens: For editing your image, you use your own background and your own object mask, plus your object crop as the identity reference. You give a short attribute-only prompt. The model updates only the specified intrinsic attribute while keeping other intrinsic features and all extrinsic context fixed.
Why it exists: This is how the model becomes a surgical editor instead of a scene rewriter. Without reusing your background/mask, it might change the camera angle or add new stuff.
Example: “shape: wider brim” for a hat; with a coarse bounding-box mask, the model can slightly reshape the brim while keeping the person and room intact.

Mask strategies for shape edits

What happens: Training alternates between precise segmentation masks and coarse bounding boxes, so the model learns both detailed attribute changes and shape flexibility.
Why it exists: Precise masks are great for color/material/texture; coarse boxes are needed for reshaping when the final silhouette isn’t known ahead of time. Without both, shape edits would suffer or precise edits would leak.
Example: Using a bounding box around a lamp to change it to a taller cylinder profile while holding the same base identity.

The secret sauce:

The identity sweet spot (VNEs) + relaxed training for data scale.
The left-right grid that lets attention share identity but confines learning to the target half.
The inference-time reuse of background/mask to lock extrinsic factors, turning a broad training skill into a precise editing tool.

Concrete mini “Sandwich” blocks for core components:

🍞 You know how you tape edges before painting a wall? 🥬 Masks and background reuse keep the scene safe so only the object’s inside changes. Why it matters: Without tape, paint (the edit) spills onto the scene. 🍞 Example: Keep kitchen tiles unchanged while turning a kettle from steel to matte white.
🍞 Imagine a buddy picture where one person shows the hairstyle you want. 🥬 Grid identity reference gives the style guide; loss on the target side ensures only the target changes. Why it matters: Without the buddy image, the haircut might not match the right style family. 🍞 Example: Keeping the same bike model fender shape while changing its finish to carbon fiber.

04Experiments & Results

🍞 Hook: Think of a taste test where people pick which cupcake best matches the flavor they asked for, but it must still be the same bakery’s signature cupcake.

🥬 The Test: The team built an evaluation set of 30 objects (about half from classic personalization benchmarks like DreamBooth and half from more diverse categories such as furniture and vehicles) and wrote 100 prompts asking for specific intrinsic changes (color, material, texture, or shape). They measured two things at once: does the edit match the requested attribute, and does the object still look like the same identity?

🥬 The Competition: Alterbute was compared with seven strong editors: FlowEdit, InstructPix2Pix, OmniGen, UltraEdit, Diptych (general-purpose), plus MaterialFusion (material transfer) and MimicBrush (texture transfer). These cover both broad editors and specialists.

🥬 The Scoreboard (with context): In head-to-head comparisons, human evaluators on CloudResearch preferred Alterbute’s results most of the time against every baseline, often above 85% preference—a bit like scoring an A+ while many others get B’s. Vision-language models (Gemini, GPT-4o, Claude) agreed with humans, showing strong alignment. On conventional metrics (CLIP/DINO for identity similarity and CLIP image–text similarity for prompt fit), Alterbute remained competitive and achieved the best text-alignment score, though the paper notes these single-number metrics can be misleading for this task if used alone.

🥬 Surprising/Interesting Findings:

VNEs matter a lot. When identity references came from DINOv2 or instance-retrieval features (instead of VNE clusters), edits weakened or identity drifted. VNE clusters reliably offered identity-consistent yet attribute-diverse examples.
Shape edits worked best with bounding-box masks but could introduce small background artifacts—acceptable for flexibility, but a trade-off.
The data was heavy-tailed: many VNE clusters are small, but a few are huge (e.g., lots of cars), showing how the method scales to popular product lines while still covering diverse items.

🍞 Anchor: If users asked “make the leather sofa a navy-blue velvet,” Alterbute’s outputs were chosen far more often than other tools—meaning people felt it matched “navy-blue velvet” well and still looked like the same sofa model in the same living room.

05Discussion & Limitations

🍞 Hook: Imagine a super-precise art tool that sometimes smudges if your tape line is too wide. It’s powerful, but you should know how and when to use it.

🥬 Limitations:

Bounding-box masks can cause small background glitches inside the box because the model must imagine the new shape without an exact silhouette to hug.
Shape edits on rigid or identity-tight objects are hard; sometimes the geometry becomes unrealistic or bumps into identity-defining parts.
Intrinsic attributes can conflict (e.g., “material: gold” with “color: black”), so the model prioritizes realistic combos; incompatible requests may not fully apply.

🥬 Required Resources:

A strong diffusion backbone (SDXL-scale UNet) and substantial compute (the paper cites training about a day on many TPUs), plus a segmentation model for masks.
Access to large datasets like OpenImages and a capable VLM to auto-label VNEs and extract attribute text.

🥬 When NOT to Use:

If you need exact CAD-precise geometry changes for engineering; this is a visual editor, not a mechanical modeler.
If the object identity is unknown, unbranded, or too generic to form a VNE; the tool relies on a recognizable product-line identity.
If you require pixel-perfect background preservation while doing large shape edits with coarse masks; use precise masks or pre-remove the object.

🥬 Open Questions:

How to further stabilize rigid-object shape edits without hurting identity?
Can we automatically refine masks or remove/recover backgrounds for shape edits to avoid artifacts?
How to expand VNE coverage to niche products and non-mass-produced items while keeping identity meaningful?
Can multi-attribute edits be planned to avoid subtle conflicts (e.g., with a small planner that checks material–color coherence)?

🍞 Anchor: Think of it like photo makeovers for products—fantastic for recolors and materials, good but not perfect at reshaping, and best when the product line (VNE) is known and consistent.

06Conclusion & Future Work

🍞 Hook: Picture a wardrobe app that can turn your same favorite jacket into leather, denim, or corduroy in your selfie—still your jacket, same pose, same room.

🥬 Three-sentence summary: Alterbute edits an object’s intrinsic attributes—color, texture, material, and shape—while preserving identity and scene context. It trains with a relaxed objective on easier-to-find examples where both intrinsic and extrinsic factors may vary, then locks extrinsic factors at inference. Visual Named Entities (VNEs) provide an identity definition that is specific yet flexible, enabling reliable, identity-preserving edits.

🥬 Main achievement: A single unified diffusion model that can surgically edit any intrinsic attribute with strong identity preservation, unlocked by VNE-based identity conditioning and an input-grid design that transfers identity while confining edits.

🥬 Future directions: Sharper shape-control (especially for rigid objects), automatic background recovery for bounding-box reshapes, richer multi-attribute planning for compatibility, and broader VNE coverage across more product categories.

🥬 Why remember this: Alterbute shows that you don’t need impossible training data to learn precise edits—you can relax training, then constrain inference, and choose the right granularity of identity (VNE) to get human-satisfying, product-line-consistent results.

Practical Applications

•E-commerce: Show the same furniture model in multiple woods and fabrics in the same staged room.
•Automotive retail: Preview car model trims in different paints and interior materials while preserving the exact model identity.
•Product design: Rapidly explore material/texture options for a product line without new photoshoots.
•Real estate and staging: Change countertop materials, cabinet finishes, or flooring while keeping the same kitchen scene.
•AR try-on: Swap sneaker colorways and materials while keeping the same shoe model on the user’s foot.
•Marketing: Maintain brand identity while creating campaign variants (color/material changes) across consistent scenes.
•Catalog localization: Adapt product visuals to regional preferences (e.g., lighter woods) without changing the product line.
•Film/game props: Adjust textures and materials of the same in-world item to fit scenes, preserving prop identity.
•Repair/restoration previews: Visualize how an original product would look with a new finish or replacement part.
•Education/demos: Teach intrinsic vs extrinsic attributes by interactively editing examples without identity drift.

Version: 1