Adapting Vision-Language Models for E-commerce Understanding at Scale

Matteo Nulli; Vladimir Orshulevich; Tala Bazazo; Christian Herold; Michael Kozielski; Marcin Mazur; Szymon Tuzel; Cees G. M. Snoek; Seyyed Hadi Hashemi; Omar Javed; Yannick Versley; Shahram Khadivi

Adapting Vision-Language Models for E-commerce Understanding at Scale

Beginner

Matteo Nulli, Vladimir Orshulevich, Tala Bazazo et al.2/12/2026

arXiv

Key Summary

•This paper shows a simple, repeatable way to teach general Vision-Language Models (VLMs) to understand e-commerce items much better without forgetting their general skills.
•The team builds a big, carefully checked training set from 15 million product listings using a Visual Verification Pipeline that keeps only image-grounded facts.
•They train in stages (align images and text, mid-stage practice, then visual instruction tuning) and add 4 million e-commerce-focused instructions.
•They introduce four new tests: Aspect Prediction, Deep Fashion Understanding, Dynamic Attribute Extraction, and Multi-image Item Intelligence.
•Across many models and sizes, the adapted models win strongly on shopping tasks and stay competitive on regular vision-language benchmarks.
•Fine-tuning for Multi-image Item Intelligence makes models both more accurate and up to 3.8x faster at inference compared to a larger zero-shot model.
•Targeted image crops (bounding boxes) and better labels significantly boost accuracy for compliance-style attribute extraction.
•Bigger text decoders and e-commerce-savvy language models often help; the best vision encoder depends on the task and image resolution.
•Even though training used single-image data, the adapted models generalize surprisingly well to multi-image understanding.
•Limitations include English-only data, platform-specific biases, some reliance on LLM-generated labels, and memory limits with very long image sequences.

Why This Research Matters

This work helps shopping sites truly understand what’s in product photos, so search and filters become more accurate and helpful. Sellers can list items faster because the AI can auto-fill attributes from images with fewer mistakes. Safety and compliance checks improve because tiny but important labels (warnings, ingredients, age ranges) are more reliably found and recorded. Customer service can answer visual questions (like 'Is this V-neck?') confidently and consistently. The approach stays general enough that the same models can still perform well on regular image–text tasks outside of shopping. Faster, smaller fine-tuned models reduce costs and latency, making these benefits practical at large scale. Altogether, this builds trust for buyers and sellers and reduces returns and confusion.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine trying to buy a pair of sneakers online. You read the title, scan the photos, and check details like brand, color, size, and materials. Your brain mixes words and pictures to decide, “Is this the one?”

🥬 The Concept (Vision-Language Models): A Vision-Language Model (VLM) is an AI that looks at images and reads text together so it can answer questions, follow instructions, and describe what it sees. How it works: (1) It turns a picture into a set of visual clues, (2) it reads any text you give it (like a title), (3) it blends the clues and the text, and (4) it writes an answer. Why it matters: Without VLMs, online shopping tools would ignore pictures or ignore text, missing half the story.

🍞 Anchor: When you ask, “What brand is this shoe?” a VLM can look at the logo in the photo and the title, and reply “Nike.”

The World Before: Online marketplaces mostly matched text to text—for example, the words in your search to the words in product titles. But buying choices depend a lot on pictures: front, back, labels, tags, and box shots. General-purpose VLMs already existed and were good at captioning or basic Q&A about images. Still, they weren’t trained for the exact, picky details that matter for shopping—like “Is the neckline a V-neck or crew neck?” or “Which warning label is printed on the back?”

🍞 Hook (Data Curation): You know how a librarian organizes books so you can find what you need fast? Good AI training data needs the same care.

🥬 The Concept (Data Curation): Data curation means cleaning and organizing raw listings so the model learns only true, helpful signals. How it works: (1) Collect listings and images, (2) make image captions, (3) check which product attributes truly match what’s visible, (4) keep only verified pairs. Why it matters: If training data has errors (like calling a blue shirt “red”), the model will learn bad habits.

🍞 Anchor: If the photo shows a round neckline, data curation makes sure the training label says “crew neck,” not “V-neck.”

The Problem: E-commerce has special challenges: (a) attribute-centric reasoning (many small, exact details), (b) multi-image aggregation (several photos tell one story), and (c) noisy seller content (typos, missing info, or wrong claims). Also, there wasn’t a clear, repeatable recipe to adapt VLMs to these challenges without making them worse at general tasks.

Failed Attempts: Text-only models could do shopping chat but missed visual facts. General VLMs could describe images but weren’t strict about e-commerce attributes (they might invent attributes or ignore important visual clues). Some works added images to text datasets, but they didn’t deeply focus on image-grounded attributes or multi-image reasoning the way marketplaces need.

🍞 Hook (E-commerce Benchmarks): Think of a report card that tests exactly the skills a student needs for a class.

🥬 The Concept (E-commerce Benchmarks): Benchmarks are tests that measure how well a model understands shopping pictures and text. How it works: (1) Build tasks like attribute prediction or category-specific checks, (2) set clear scoring rules, (3) compare models fairly. Why it matters: Without the right tests, you can’t tell if a model is ready for real shoppers.

🍞 Anchor: A benchmark might ask, “From these sneaker photos, identify the brand and model.”

The Gap: We needed (1) a backbone-agnostic recipe that works with different vision encoders and language models, (2) a massive, verified e-commerce instruction dataset to teach models attribute discipline, and (3) a broad evaluation suite covering single-image details, multi-image reasoning, and strict instruction following.

Real Stakes: Better product understanding means easier listing for sellers (auto-fill attributes from photos), better search (find the exact style), safer shopping (catch missing warnings or ingredients), and faster customer support (clear, structured product facts). This saves time, reduces returns, and builds trust.

🍞 Hook (Multi-image Aggregation): Imagine making a jigsaw puzzle—you need all the pieces to see the whole picture.

🥬 The Concept (Multi-image Aggregation): Multi-image aggregation means combining clues from several photos of the same item. How it works: (1) Look at each image, (2) pick useful regions (like labels), (3) merge facts across photos, (4) produce a single, clean answer. Why it matters: One photo rarely shows every detail; important safety or brand info may be on the back or side.

🍞 Anchor: For a toy, the front photo shows the brand, but the back photo shows the age warning; the model must use both.

02Core Idea

🍞 Hook: You know how a good coach doesn’t change the entire team, but adds the right drills to win the next game? That’s the idea here.

🥬 The Concept (Aha! Moment): The key insight is: with a careful data pipeline and targeted instruction tuning on image-grounded e-commerce tasks, general VLMs can be adapted to excel at shopping details without losing their general skills. How it works: (1) Verify which attributes are truly visible in images, (2) build millions of focused instructions, (3) train in stages to align vision and language, (4) fine-tune specific multi-image tasks with better labels and smart crops. Why it matters: Randomly adding shopping data can teach shortcuts or hallucinations; verified, image-grounded training teaches precision.

🍞 Anchor: After adaptation, when asked “List all visible attributes of this handbag,” the model outputs brand, material, color, and visible logos that truly appear in the photos—no guesses.

Three Analogies:

Glasses Upgrade: The base VLM can see, but not crisply. The verified, attribute-focused data acts like new glasses that sharpen tiny details (logos, seams, tags).
Grocery Scanner: Instead of reading the whole store shelf, the model learns to scan barcodes (key regions) and record exact facts.
Orchestra Rehearsal: The vision encoder (strings) and text decoder (wind instruments) already know music; the new rehearsal pieces (benchmarks and instructions) make them play tightly together for the e-commerce concert.

Before vs After:

Before: Models were good at general image talk but sloppy with attribute names, multi-image consistency, and strict formats.
After: Models extract precise, image-verified attributes, follow tight instructions (e.g., output JSON), and combine clues across multiple photos, while still doing well on general VLM tests.

🍞 Hook (Visual Verification Pipeline): Imagine a fact-checker who looks at the picture and says, “Yes, that attribute is truly visible.”

🥬 The Concept (Visual Verification Pipeline): It’s a data-building process that pairs images with only those attributes that the image actually shows. How it works: (1) Caption each image with a strong VLM, (2) compare the caption with seller-listed attributes, (3) use a language model to keep only attributes supported by the caption (and thus by the image), (4) create clean instructions for training. Why it matters: It prevents training on mismatches like calling a square watch “round.”

🍞 Anchor: If the caption says “leather strap” and the listing says “silicone strap,” the pipeline drops or corrects that attribute.

🍞 Hook (Instruction Tuning): You know how practicing the exact plays you’ll use in a game makes you better at game time?

🥬 The Concept (Instruction Tuning): Instruction tuning means practicing with prompts and answers that match real e-commerce tasks. How it works: (1) Write prompts like “Extract all visible attributes,” (2) require strict outputs (like JSON), (3) include yes/no and free-form Q&A, (4) vary context (with/without title, OCR text, or category). Why it matters: Without practice on the real playbook, the model won’t reliably follow rules or formats.

🍞 Anchor: The model is told, “List only attributes visible in the image, not guesses,” and it learns to do just that.

Building Blocks:

Verified single-image instructions (about 4M) covering VQA, Dynamic Attribute Extraction, precise instruction following, and listing generation.
Staged training: (a) align vision and language, (b) mid-stage practice, (c) visual instruction tuning.
Task-specific fine-tuning for multi-image item intelligence with better labels and smart cropping.
Broad evaluation across both internal and public benchmarks to ensure no loss of general ability.

Why It Works (Intuition): The model learns to trust the pixels first. By filtering labels to what’s visible, it stops picking up habits from noisy metadata. Instruction variety teaches it to switch skills: detect small logos, read printed text, obey output formats, and merge facts across photos. Because we keep general-domain data in the mix, the model doesn’t forget how to do non-shopping tasks.

🍞 Hook (Dynamic Attribute Extraction): Think of playing “I spy,” but you write down every true thing you see.

🥬 The Concept (Dynamic Attribute Extraction): This teaches the model to discover and serialize all visible attributes without a fixed list. How it works: (1) Look at the image, (2) decide which attributes matter, (3) write them as key–value pairs, (4) avoid anything not visible. Why it matters: Real products vary; rigid schemas miss important facts.

🍞 Anchor: For a DVD cover, the model might output {"title": "Movie Name", "format": "Blu-ray", "rating": "PG-13", "studio": "XYZ"} only if those appear on the cover.

🍞 Hook (Aspect Prediction): When shopping, you often ask, “What’s the sleeve length?” because it changes the fit.

🥬 The Concept (Aspect Prediction): Aspect prediction is choosing the right value for a specific attribute (like brand or neckline). How it works: (1) Read the prompt and optional title/category, (2) scan the image, (3) pick the best class. Why it matters: Search, filters, and recommendations depend on correct aspects.

🍞 Anchor: Given a men’s shirt photo, the model answers “Short sleeve” instead of “Long sleeve.”

03Methodology

At a high level: Input (product images + optional title/category) → Verified instruction data building → Three-stage training (align → mid-stage → instruction tuning) → Optional multi-image fine-tuning with better labels and crops → Output (precise answers or structured JSON).

Step 1: Build High-Quality, Image-Grounded Data

What happens: Collect ~15M listings. For each listing’s main image, generate a rich caption (InternVL-2.5-26B). Compare these captions with seller attributes. Use a language model (Mistral-Small-3-24B) to keep only attributes that the caption supports (so they’re likely visible). Create training instructions from these verified pairs.
Why this exists: It removes mismatches and hallucinations in labels. Without it, models learn to trust noisy text more than pixels.
Example: Title says “silk scarf,” but the captioning sees knit texture. The verifier rejects “silk,” keeping only attributes clearly supported by the image.

🍞 Hook (Visual Verification Pipeline): Like a referee checking instant replay to confirm the call.

🥬 The Concept: The Visual Verification Pipeline is a multi-step checker ensuring that training pairs are truly image-grounded. How it works: (1) Caption image, (2) cross-check attributes, (3) keep only visually supported ones, (4) turn them into instructions. Why it matters: Prevents teaching the model to parrot noisy seller text.

🍞 Anchor: If the photo shows “8GB RAM” printed on the box, the pipeline keeps that attribute; if not, it’s discarded.

Step 2: Three-Stage Training

Stage A: Vision-Language Alignment
- What happens: Use standard alignment data (e.g., BLIP-LAION 558k) so the model’s vision and text parts speak the same “language.”
- Why: Without alignment, the model can’t connect visual patterns to words reliably.
- Example: The model learns that swoosh-like shapes often map to “Nike.”
Stage B: Mid-Stage Training
- What happens: Train on a curated mixture of visual tasks to strengthen general multimodal skills while removing low-signal subsets.
- Why: Builds robust perception (OCR, charts, diagrams) to avoid overfitting to e-commerce too early.
- Example: Reading text on street signs improves label-reading on packaging later.
Stage C: Visual Instruction Tuning (Single-Image)
- What happens: Practice on ~4M internally built e-commerce instructions plus a LLaVA-OneVision mixture. Tasks include VQA (yes/no and open-ended), Dynamic Attribute Extraction, Precise Instruction Following (format control, keyword constraints), and Listing generation. Variants add OCR text, title/category context, and length constraints.
- Why: This makes the model follow rules, extract exact attributes, and stay faithful to pixels.
- Example: “Only return attributes visible in the image; output compact JSON with keys and values.”

🍞 Hook (Instruction Tuning): It’s like drilling plays so the team can run them perfectly in the game.

🥬 The Concept: Instruction tuning teaches the model to follow real e-commerce prompts tightly. How it works: (1) Give task-specific prompts, (2) require strict outputs, (3) mix formats and difficulties. Why it matters: Without it, the model might ramble or ignore format rules.

🍞 Anchor: The model learns to return just {"brand": "Sony", "model": "WH-1000XM5"} instead of a long paragraph.

Step 3: Multi-image Item Intelligence Fine-Tuning (Optional, Production-Style)

What happens: For regulatory and compliance use-cases, curate 100k multi-image items (median 5 images). First, auto-annotate with a strong LLM (GPT-4.1). Then use Qwen2.5-VL-32B to draw bounding boxes around informative regions (logos, ingredient panels, safety labels). Re-annotate using cropped regions plus originals to get better labels. At inference, optionally include targeted crops.
Why: Important facts live in small regions. Without focusing on them, models miss tiny but critical details.
Example: A toy’s age warning is tiny text on the back; bounding boxes highlight it so the model can read it.

🍞 Hook (Bounding Boxes and Crops): Like using a magnifying glass on the important parts of a map.

🥬 The Concept: Bounding boxes and image crops focus the model on the most informative areas. How it works: (1) Propose boxes around text/logos, (2) expand/merge for full coverage, (3) crop and feed these along with originals. Why it matters: It boosts accuracy on tiny, crucial details while controlling compute.

🍞 Anchor: Cropping just the nutrition facts panel helps the model extract ingredients more accurately.

Architectures Compared

Vision encoders: SigLIP2 (great general features) and Qwen2.5 ViT (native high-res support). Choice depends on task and image size.
Text decoders (LLMs): Llama 3.1, e-Llama (e-commerce adapted), Lilium (trained for e-commerce), Qwen3, Gemma3. Larger, more knowledgeable LLMs often help, especially on general benchmarks.

🍞 Hook (Vision Encoder): Think of the camera on your phone—the lens quality changes what you can see.

🥬 The Concept: A vision encoder turns images into tokens the language model can understand. How it works: (1) Split image into patches, (2) encode features, (3) pass features to the text model. Why it matters: If the encoder misses small details, the language model can’t reason about them.

🍞 Anchor: A high-res encoder can notice a tiny CE mark on packaging; a lower-res one might miss it.

🍞 Hook (Text Decoder): Imagine the narrator in a documentary—clear, precise words matter.

🥬 The Concept: The text decoder writes the model’s final answers. How it works: (1) Read visual tokens, (2) combine with prompt and context, (3) generate the next word step-by-step. Why it matters: A stronger decoder follows instructions better and avoids format mistakes.

🍞 Anchor: It outputs clean JSON instead of a chatty paragraph when asked.

🍞 Hook (Fine-tuning): It’s like extra practice right before a recital.

🥬 The Concept: Fine-tuning is a short training phase on your exact task. How it works: (1) Gather task data, (2) train a bit more, (3) lock in the needed behavior. Why it matters: It sharpens performance dramatically without retraining from scratch.

🍞 Anchor: Fine-tuning on compliance data boosts accuracy for detecting warnings and ingredients.

🍞 Hook (LLM-as-a-judge): Think of a teacher grading an essay with a rubric.

🥬 The Concept: LLM-as-a-judge uses a separate language model to score answers for tricky tasks. How it works: (1) Provide the prompt, image description, and model output, (2) ask the judge to grade for correctness and format, (3) average scores. Why it matters: Some tasks aren’t easy to auto-grade with exact string match.

🍞 Anchor: For multi-image JSON outputs, the judge checks if each requested attribute is correctly filled and grounded.

Efficiency Notes

Pan & Scan vs targeted crops: Both help, but targeted crops tailored to informative regions usually work better.
Token budget: Crops plus deduplication (e.g., perceptual hashing) keep the number of images manageable to avoid memory issues.

04Experiments & Results

The Test: The team measures three things: (1) deep product understanding (like sleeve types, patterns, logos), (2) strict instruction following (especially JSON formats, keyword constraints), and (3) dynamic attribute extraction (discover what’s visible without a fixed schema). They also test multi-image Item Intelligence (compliance-style extraction). To make sure the models remain broadly capable, they evaluate on public general benchmarks such as MMBench, MME, MMMU, TextVQA, and eComMMMU.

The Competition: They compare many model mixes—different vision encoders (SigLIP2 vs Qwen2.5 ViT) and language decoders (Llama 3.1, e-Llama, Lilium, Qwen3, Gemma3) plus strong open VLMs (Qwen3-VL, Gemma3, LLaVA-OneVision). This tells us whether gains come from the vision side, language side, size, or the adaptation strategy itself.

The Scoreboard (with context):

On internal e-commerce tasks like Aspect Prediction and Deep Fashion Understanding, adapted models consistently outperform general open VLMs. For example, with SigLIP2 + Qwen-3-8B, Aspect Prediction (Fashion + Title & Category) reaches around the high 60s to near 80%, which is like moving from a solid B to an A-/A when others hover in the B range.
On Dynamic Attribute Extraction, adapted models substantially improve faithful, image-grounded outputs (around 66–71% on different mixes), which is crucial because this task punishes hallucinations.
On public general-domain benchmarks, top external VLMs like Qwen3-VL-8B often lead, but adapted models remain competitive, showing the adaptation does not wipe out broad skills.
On eComMMMU (multi-image, public e-commerce), the internal adapted models gain notable points over their non-adapted counterparts—even though training used primarily single-image instructions. That’s like practicing with one ball but still winning a multi-ball drill.

Vision Encoder Findings:

No single winner across all tasks. Qwen2.5 ViT can shine in higher-resolution settings (small details), while SigLIP2 is very strong in general. Since many benchmarks use low-to-mid resolutions, their results come out close.

Text Decoder Findings:

E-commerce knowledge helps: e-Llama and Lilium (e-commerce adapted) tend to do better on shopping tasks than plain Llama 3.1.
General capability helps too: Newer, stronger decoders like Qwen3 and Gemma3 often boost both general and shopping tasks, especially Aspect Prediction.
Size matters up to a point: Going from 1B to 4B often helps; 4B to 8B sometimes gives smaller gains depending on task complexity.

Multi-image Item Intelligence (Compliance-style) Results:

Zero-shot Gemma3-27B already benefits from using all images vs just the primary one. But fine-tuning changes the game: a fine-tuned Gemma3-4B becomes both faster (about 3.8x) and more accurate (F1 around 50.5) than the zero-shot 27B on this task. Fine-tuned Gemma3-27B improves further.
Better labels + crops: Using bounding boxes to guide labels and including targeted crops at inference boosts performance across sizes (e.g., Gemma3-27B F1 rising toward the high 50s). This shows it matters a lot where the model looks.

Surprising Findings:

Single-image training generalizes to multi-image evaluation better than expected—adaptation teaches behaviors (precision, faithfulness, instruction following) that transfer.
Task-specific decoder knowledge and size can lift not only text-heavy benchmarks but also visually grounded shopping tasks.
Carefully pruning low-signal mid-stage data helps keep the model sharp and avoids overfitting to junk.

Concrete Example Anchors:

Dynamic Attribute Extraction: For a cosmetics box, the adapted model outputs {"brand": "BrandX", "shade": "Rosewood", "finish": "Matte"} only if those are seen on the packaging—no invented claims like “cruelty-free” unless a logo says so.
Multi-image Compliance: Across images of a toy, the model pulls brand from the front, age warning from the back, and country of origin from a side label, merging into one clean JSON.

05Discussion & Limitations

Limitations:

Monolingual scope: Everything is in English, so performance on other languages, scripts, and local sizing/currency formats remains unknown.
Platform dependence: Data and prompts come mainly from one marketplace, so habits tied to that site (like attribute names or photo styles) may not transfer perfectly elsewhere.
LLM-mediated supervision and evaluation: Parts of data labeling and judging use LLMs, which can introduce biases or over-alignment to similar model families.
Coverage gaps: Some categories (long tail), rare attributes, or unusual photo styles are underrepresented, and the DAE set is about 1k items.
Long image sequences: More than 10 images can cause memory or latency issues without token-efficient strategies or larger context models.

Required Resources:

Compute: Multi-stage training at scale (dozens to 100+ GPUs) for large instruction sets.
Data: Millions of listings and images plus captioning/verification steps.
Models: Access to strong captioners, verifiers, and vision encoders for bounding boxes.

When NOT to Use:

Cross-lingual, specialized locales where units, sizes, or scripts (e.g., JP/EU sizing, multi-script OCR) are critical and not represented in training.
Ultra-high-resolution tasks if the chosen vision encoder can’t handle native high-res or if memory budgets are tight.
Domains where hallucination-free legal or medical claims must be guaranteed without human oversight.

Open Questions:

How well does this transfer to other marketplaces with different ontologies and seller behaviors?
What’s the best way to scale to 10–40 images per item without OOM or lag (token-efficient pooling, visual memory, or hierarchical encoding)?
Can we make the verification pipeline multilingual and multi-script, including local units and currencies?
What is the optimal balance of general and domain data to maximize both broad skills and e-commerce precision?
Can smaller, efficient decoders match large-model accuracy with the right cropping and teacher signals?

06Conclusion & Future Work

Three-Sentence Summary: The paper delivers a clear, backbone-agnostic recipe to adapt general VLMs for e-commerce by building verified, image-grounded instruction data and training in focused stages. It introduces comprehensive benchmarks for aspects, deep fashion attributes, dynamic attribute extraction, and multi-image item intelligence, proving big in-domain gains without harming general abilities. Targeted fine-tuning with better labels and smart crops further boosts accuracy and speed for compliance-style tasks.

Main Achievement: A practical, scalable adaptation pipeline that turns general VLMs into precise e-commerce experts—faithful to the pixels, disciplined with attributes, and still broadly capable on public benchmarks.

Future Directions:

Multilingual and multi-script extension (OCR for diverse scripts, local units/sizes/currencies).
Token-efficient multi-image processing and longer visual contexts to avoid OOM and reduce latency.
Domain-adapting stronger decoders (e.g., Qwen3/Gemma3) to combine broad knowledge with e-commerce precision.
Richer, human-verified datasets for long-tail categories and rare attributes.
Automated quality control to reduce bias from LLM-as-a-judge.

Why Remember This: It’s a blueprint for teaching AI to truly “look before it speaks” in shopping—verifying attributes from pixels, handling many photos, and returning clean, useful facts. That means better search and filters, safer purchases, faster listings, and more trust between buyers and sellers—at internet scale.

Practical Applications

•Auto-attribute extraction for new listings (brand, color, size, materials) directly from photos.
•Deep fashion tagging (neckline, sleeve length, pattern) to power more accurate search filters.
•Compliance scanning across multiple photos to capture warnings, ingredients, and certifications.
•Catalog cleaning by verifying that listed attributes match what’s visible, reducing hallucinations.
•Question-answer assistants that follow strict formats (e.g., JSON) for downstream systems.
•OCR-driven detail capture on packaging (model numbers, barcodes, serials) for inventory and support.
•Multi-image summarization of product facts for customer-facing pages and chatbots.
•Faster, cheaper inference using fine-tuned smaller models for large-scale marketplace pipelines.
•Cross-checking title/category claims against the pixels to flag potential mislistings.
•Bulk enrichment of long-tail items by discovering dynamic, image-grounded attributes.

Version: 1