Enhancing Multi-Image Understanding through Delimiter Token Scaling

Minyoung Lee; Yeji Park; Dongjun Hwang; Yejin Kim; Seong Joon Oh; Junsuk Choe

Enhancing Multi-Image Understanding through Delimiter Token Scaling

Intermediate

Minyoung Lee, Yeji Park, Dongjun Hwang et al.2/2/2026

arXiv PDF

Key Summary

•Large Vision-Language Models (LVLMs) are great with one picture but get confused when you give them several, often mixing details from different images.
•Models already use special divider tokens to mark where one image ends and the next begins, but those tokens don’t block the mix-ups well enough.
•This paper’s simple idea is to turn up the “volume” of those divider tokens by scaling their hidden states, so the model pays them stronger attention.
•Stronger dividers act like local magnets: they pull together information inside each image while pushing away distractions from other images.
•This lowers cross-image leakage and keeps within-image reasoning strong, improving accuracy on many multi-image benchmarks.
•The same trick also helps in text-only cases that need clear separation, like multi-document summaries and questions over multiple tables.
•There is no extra training or inference cost, and the method works across different model sizes and families.
•Compared to prior training-free methods like Focus, this approach is both faster and more memory-efficient while achieving better scores.
•Ablations show it’s important to scale the true delimiter tokens (not just any token) and that early-layer scaling works best.
•The method currently needs access to model internals (hidden states) and clear delimiters, so it’s easiest to use with open-source models.

Why This Research Matters

Many real tasks compare or combine information across multiple items: looking at two photos, summarizing several news articles, or answering questions from multiple tables. If the model mixes up details between items, it becomes unreliable and frustrating to use. This paper’s method strengthens the built-in separators so each item’s details stay in their lane, leading to clearer, more accurate answers. It’s practical: no retraining, no slowdown, and works across many models and tasks. That means teams can deploy it quickly to make their multi-item features better today. As models get larger and inputs get longer, simple boundary-keeping like this becomes even more important.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re sorting photos from different family trips. If you toss them all into one pile with no labels, you’ll soon start mixing the beach pictures with the mountain ones.

🥬 The Concept (Large Vision-Language Models, or LVLMs): LVLMs are AI systems that look at images and read text together to answer questions or describe scenes.

How it works: 1) Break the image into pieces the model can read, 2) Mix those with words, 3) Use attention to decide which bits matter most, 4) Produce an answer.
Why it matters: Without LVLMs, computers can either read or see—but not combine both well. Many real tasks (like asking questions about a chart) need both. 🍞 Anchor: When you ask a model, “Which animal in this picture is sleeping?” it uses both the words and the picture to find the answer.

🍞 Hook: You know how sticky notes between chapters help you keep stories separate in a big book?

🥬 The Concept (Delimiter Tokens): Delimiter tokens are special markers inside the model’s input that say, “A new image starts here!”

How it works: 1) The model inserts a start and end marker for each image, 2) It learns to treat everything between them as one “image block,” 3) When reading, it can tell which tokens belong to which image.
Why it matters: Without delimiters, the model can’t tell where one image ends and the next begins, so it mixes details across images. 🍞 Anchor: If you show two photos (a dog in Image 1 and a cat in Image 2), delimiters are like clear labels that say “Dog section” and “Cat section.”

🍞 Hook: Think about listening to a band. Your ears focus more on the singer’s voice during the chorus than on the background drum taps.

🥬 The Concept (Attention Mechanism): Attention is the model’s way to focus on the most useful parts of the input at each step.

How it works: 1) Look at all words and image pieces, 2) Score how helpful each piece is right now, 3) Give higher scores to helpful pieces, 4) Use those to decide what to say next.
Why it matters: Without attention, the model treats “sky” and “answer” as equally important and gets confused. 🍞 Anchor: If you ask, “What color is the car in the first photo?”, attention boosts “car” and “first photo” details and ignores unhelpful bits.

The world before: LVLMs did great on single images—naming objects, reading signs, and answering questions. But when you fed multiple images together (like comparing photos A and B), their performance dropped. Researchers noticed the model often blended details from different images, like saying the bicycle from Image 2 was in Image 1.

The problem: This blending is called cross-image information leakage. Even though models already insert delimiter tokens between images, the leakage still shows up in attention maps and in wrong answers.

Failed attempts:

Make special multi-image training data and fine-tune: Works, but it’s expensive to gather high-quality data and train big models.
Preprocess images to keep only relevant parts (e.g., AVAM): Reduces clutter, but adds extra modules and complexity.
Contrastive decoding (e.g., Focus): Helps separate outputs but needs many extra forward passes (slow and memory-hungry).
Temporal embeddings (M-RoPE): Inspired by video, but less effective alone for static multi-images and still not as simple.

The gap: We needed a simple, training-free, model-agnostic tweak that strengthens the separation between images without slowing inference or changing the architecture.

The stakes (why you should care):

Shopping: Compare two product photos without mixing features (wrong size from the other image is a costly mistake).
Education: Side-by-side diagrams or pages shouldn’t get swapped accidentally in answers.
Office work: Summarizing many documents or answering questions from multiple tables needs clean boundaries.
Medicine: When comparing two scans, mixing details could confuse a diagnosis.
Everyday use: When you ask, “Which picture shows the dog sitting?” the model should pick the correct image, not scramble them.

🍞 Anchor: Think of the model as a librarian sorting books (images). Before, the shelf labels (delimiters) were too faint, so books from different series got mixed. This paper makes the labels bold and bright so each book stays in its series.

02Core Idea

🍞 Hook: Imagine you’re running a relay race with lanes painted on the track. If the lane lines are faint, runners drift and bump into each other.

🥬 The Concept (Cross-Image Information Leakage): Cross-image leakage is when the model confuses details from different images, like moving out of its lane.

How it works: 1) The model reads multiple images at once, 2) Attention sometimes spills from Image A to Image B, 3) The answer blends details (e.g., says the bike from B appears in A).
Why it matters: Without fixing this, multi-image reasoning is unreliable. 🍞 Anchor: When asked, “Which image has the man on a bicycle?”, leakage makes the model say “both,” even if only the second image has the bike.

🍞 Hook: You know how turning up the volume on a quiet voice makes it easier to hear in a noisy room?

🥬 The Concept (Hidden State Scaling of Delimiter Tokens): The key insight is to gently amplify the hidden states of the delimiter tokens so the model notices them more.

How it works: 1) Find the delimiter tokens that mark image boundaries, 2) At selected early layers, multiply their hidden vectors by a factor bigger than 1, 3) Continue the forward pass as usual, 4) The model naturally gives more attention to these stronger dividers.
Why it matters: Stronger dividers make each image’s “lane” clearer, reducing leakage without hurting within-image reasoning. 🍞 Anchor: It’s like bolding the chapter separators in a book so you never mix up which chapter a paragraph belongs to.

🍞 Hook: Picture a coach who gives each team a unique wristband color to keep teammates together during drills.

🥬 The Concept (Image-wise Tagging): Each delimiter acts like a local tag that nudges tokens inside the same image to stick together.

How it works: 1) Tokens inside Image i pay extra attention to its delimiter i, 2) That shared pull acts like a local bias they all share, 3) This strengthens within-image interactions.
Why it matters: Without this tagging, even strong dividers could weaken internal teamwork and hurt reasoning inside each image. 🍞 Anchor: It’s like giving all players on Team Red the same beacon so they naturally coordinate with each other, not with Team Blue.

One-sentence “Aha!”: Make the image separators louder so the model sees crisp borders between images while still letting each image think clearly within itself.

Three analogies:

Book tabs: Thicker, brighter tabs between chapters mean you don’t accidentally read lines from the next chapter.
Traffic lanes: Repainting faded lane lines keeps cars in the right lane without adding new traffic lights.
Locker labels: Bigger, clearer locker numbers stop kids from opening the wrong locker by mistake.

Before vs After: Before scaling, attention maps show fuzzy blocks with spillover across images; answers often mix content (“camels and polar bears” reversed). After scaling, attention clusters cleanly inside each image block, and the model correctly ties details to the right picture.

Why it works (intuition, no equations):

Attention is a competition for focus. When delimiters are weak, other-image tokens win some focus they shouldn’t.
Scaling the delimiter’s hidden state makes it stand out in both the matching step (who to look at) and the message step (what information they pass along).
Because this boost is local (each delimiter mostly pulls its own image), within-image teamwork stays strong while cross-image distractions shrink.
Early-layer boosts propagate forward, shaping later attention to respect boundaries.

Building blocks:

What to scale: The true image delimiter tokens (model-specific), not just any special token.
Where to scale: Preferably early layers, so the effect flows through the network.
How much to scale: A gentle factor (>1) works; results are robust across a reasonable range.
Efficiency: No architecture changes, no extra passes, keeps fast attention kernels intact.
Generality: Works for multi-image tasks and also for multi-document and multi-table inputs that need clear separation.

03Methodology

High-level recipe: Input (multiple images + text) → Insert image delimiter tokens → Encode into tokens → Transformer layers with attention → At selected early layers, multiply delimiter hidden states by a scaling factor → Continue forward pass → Output answer.

Step 1: Prepare the input with true delimiters

What happens: For each image, the model inserts special start/end markers that say “this is one image block.” For text-only multi-item tasks (documents/tables), we pick their natural separators (e.g., ||||| in MultiNews) as delimiters.
Why this step exists: Without real separators, the model can’t easily tell which tokens belong together, and mix-ups grow.
Example: Two images A and B and a question: “Which image has a red bike?” We format as [<imgA_start> imageA tokens <imgA_end>] [<imgB_start> imageB tokens <imgB_end>] [question tokens].

Step 2: Tokenize and encode

What happens: Images become visual tokens (like patches or grid features), and text becomes word tokens. These form one long sequence with delimiters in between.
Why this step exists: The transformer reads one big list of tokens; everybody needs to be in the same line to be compared.
Example: Image A might become 576 tokens, then <imgA_end>, then image B tokens, then <imgB_end>, then the words of the question.

Step 3: Start the transformer pass (unmodified)

What happens: In each layer, queries, keys, and values (Q, K, V) get computed from hidden states, and attention decides which tokens talk to which.
Why this step exists: Attention makes the model pick the most helpful pieces at each step.
Example: When thinking about “red,” tokens that describe red areas (in the relevant image) should get higher attention weights.

Step 4: The secret sauce—scale delimiter hidden states early

What happens: In chosen early layers, we multiply only the delimiter tokens’ hidden vectors by a factor a little bigger than 1. Think of this as turning up their brightness.
Why this step exists: Brighter delimiters get more attention and pass along stronger messages, which (a) strengthen within-image interactions and (b) reduce attention paid to other-image tokens.
Example: If the delimiter for Image B was a bit ignored before, now it stands out, so tokens inside Image B look to it more, and look less at Image A.

Step 5: Continue the forward pass

What happens: The rest of the model runs exactly the same—no extra steps, no extra passes. The boosted delimiters keep shaping attention in later layers.
Why this step exists: We want zero slowdown and zero architecture changes.
Example: By the time the model forms its final answer token, the internally built boundaries are clearer.

Step 6: Decode the answer

What happens: The model uses the final hidden state to generate text (or choose from options).
Why this step exists: That’s the normal output stage; our tweak doesn’t change it.
Example: The model outputs: “Image 2 has the red bicycle.”

What breaks without each step?

Missing real delimiters: The model can’t form strong triangular attention blocks, and leakage grows.
No scaling: Even with delimiters, they’re too quiet; attention still spills across images.
Scaling too late: Effects don’t propagate well, so boundaries stay fuzzy.
Scaling random tokens: Doesn’t help (and can hurt); only true delimiters carry the right structure.

Walk-through with concrete data:

Input: Two images (A: dog on grass; B: cat on sofa) and question “Which image has the cat?”
Before scaling: Attention shows some pull from A’s tokens to B’s and vice versa. The model may say “both” or “Image 1.”
After scaling: Image B’s delimiter becomes a strong local magnet. Tokens inside B share a common nudge that keeps their focus together; tokens in A get less distracted by B. The answer becomes “Image 2.”

Why this is clever:

It boosts both parts of the delimiter’s job at once: (1) attract its own image’s tokens (clear mapping), and (2) act as a shared local bias that keeps teammates together (image-wise tagging).
It avoids touching the attention kernel or doing extra passes, so it stays fast and memory-light.
It’s plug-and-play across model families by just pointing at the correct delimiter tokens.

Ablations and helpful tips:

Scale early layers: Works best because early structure guides later reasoning.
Don’t scale the first token or a random special token: Small or negative gains; they lack the image-tagging role.
Q vs K vs V scaling: Scaling hidden states naturally boosts all of them together and works best overall; scaling only Keys helped more than only Queries or Values, but still less than scaling hidden states.
Hyperparameter λ: Moderate values consistently improved results; very small (<1) hurts by making delimiters too quiet.

Edge care for text-image interaction: Text tokens already attend strongly to each other; boosting delimiters reduced text-to-image attention only slightly (~10%) while keeping alignment good in practice. Few-shot interleaved examples (image + Q/A pairs) still improved, showing the method plays well with text.

04Experiments & Results

The test: Measure how much better models answer questions or make comparisons when given multiple images (or multiple documents/tables) after we boost delimiter tokens. We care about two things: 1) fewer mix-ups across items, and 2) no loss in within-item reasoning.

The competition:

Baseline LVLMs using standard delimiters (Qwen2.5-VL, InternVL3, LLaVA-OneVision).
Training-free Focus (contrastive decoding) and M-RoPE-style temporal embeddings as alternative strategies.

Scoreboard with context:

Multi-image tasks (Mantis, MuirBench, MIRB, QBench2): Across three model families and sizes from tiny (0.5B) to large (32B), scaling delimiters consistently added points. For example, on MuirBench with Qwen2.5-VL-3B, accuracy climbed from about 37.3 to 42.4—like jumping from a C to a solid B in one stroke, without extra training. On Mantis, Qwen2.5-VL-3B rose from ~59.9 to ~63.1; InternVL3-2B on Mantis moved from ~52.1 to ~54.4. These steady bumps showed up model-after-model, benchmark-after-benchmark.
Very large models: Even at 72B/78B scale, Mantis scores still nudged up (e.g., Qwen2.5-VL-72B improved from ~74.2 to ~75.6), showing the idea scales.
Text-only multi-instance tasks (MultiNews, WCEP-10): ROUGE-1/2/L improved slightly but consistently across models. Think of it as moving from “almost tied” to “noticeably better” in summarizing multiple articles.
Multi-table QA (TQABench): Accuracy went up; notably, Qwen2.5-3B + our method beat the 7B baseline in this setting—showing smarter separation can beat brute size.

Surprising or notable findings:

No cost bump: Memory and speed basically matched the baseline, while Focus needed much more memory and time (often several times slower and could hit OOM on larger models).
Scaling the first token (common sink) helped only a tiny bit; it lacked the image-tag effect, so our method still won by a clear margin.
M-RoPE alone underperformed baseline, but combining with true delimiters helped; still, delimiter scaling was stronger and simpler.
Cross-image attention visibly dropped by about half in some directions (e.g., from Image 3 to Images 1/2), while within-image attention stayed steady—exactly the behavior we want.
Few-shot with interleaved examples (image, Q, A, repeated): Accuracy still went up, showing text–image coordination remains healthy.

Qualitative examples:

Bicycle case: Baseline claimed both images had a man on a bicycle; our method correctly linked the bike to only the second image.
Animals case: Baseline swapped which image had camels vs. polar bears; our method kept the mapping straight.

Takeaway: Across families, sizes, and tasks, delimiter scaling worked like repainting lane lines: small effort, broad, dependable gains. It’s not a flashy overhaul; it’s a clean, targeted fix that the numbers—and attention maps—back up.

05Discussion & Limitations

Limitations (be specific):

Needs access to hidden states and the exact delimiter tokens: That’s easy in open-source models but not possible from outside most proprietary APIs.
Requires true, consistent delimiters: If the input format lacks clean markers (e.g., some video streams without frame separators), you can’t directly apply it.
Over-scaling or late-layer-only scaling can unbalance attention; good defaults help, but extreme settings can harm results.
It is designed to reduce leakage; on rare tasks that demand heavy cross-image fusion at every step, too-strong separation might not help.

Required resources:

Standard GPUs that already run your LVLM. Because we don’t add extra passes or change attention kernels, memory and speed are essentially the same as baseline.
A small hyperparameter search for which early layers to scale and by how much.

When NOT to use:

Single-image tasks: There’s no multi-item confusion to fix.
Inputs without clear separators: If you can’t identify true delimiters, the method doesn’t know what to scale.
Tasks where deliberate cross-image blending is the core skill at every step (uncommon); use moderate λ to avoid over-separation.

Open questions:

Can we learn the scaling factor automatically per input without breaking efficiency (e.g., an efficient proxy for entropy-based adaptation)?
What are the best universal default layers across many architectures, to minimize tuning?
Can we generalize to videos with learned temporal separators or lightweight frame tags?
How does this interact with retrieval-augmented inputs that interleave many small contexts?
Are there principled ways to combine delimiter scaling with modest training for even bigger gains?

06Conclusion & Future Work

Three-sentence summary: This paper shows that multi-image confusion in LVLMs comes from weak separation between images, even when delimiter tokens are present. By simply scaling the hidden states of those delimiters in early layers, we make image boundaries crisp, reducing cross-image leakage while preserving within-image reasoning. The method works across models and tasks, boosts accuracy, and adds virtually no compute or memory overhead.

Main achievement: A tiny, training-free tweak—amplifying delimiter tokens—that consistently strengthens multi-image understanding and even helps multi-document/table tasks, without slowing inference or modifying attention kernels.

Future directions: Automate layer and factor selection, explore lightweight adaptive scaling (e.g., entropy-informed but efficient), extend to video by learning temporal separators, and combine with small amounts of targeted training to push gains further. Also, build friendly APIs so proprietary systems can adopt it internally.

Why remember this: It’s a clean engineering insight—make the existing separators do their job better. Like repainting lane lines, it’s simple, cheap, and unexpectedly powerful, turning messy traffic into smooth, well-separated flow.

Practical Applications

•Improve comparison questions across multiple product photos in shopping assistants without retraining the model.
•Boost accuracy when summarizing clusters of news articles by marking and strengthening document separators.
•Answer multi-table business questions more reliably by scaling the table delimiter tokens.
•Make classroom tools better at handling worksheets with multiple diagrams or pages in one prompt.
•Support medical research prototypes that compare side-by-side scans (using open-source models and proper governance).
•Help legal or finance analysts keep multiple contracts or reports separate during long-context Q&A.
•Enhance few-shot prompts that interleave several examples with images and answers before the real question.
•Stabilize photo forensics tasks that compare two similar images without mixing their details.
•Enable on-device or memory-limited deployments to handle multi-image tasks without extra compute cost.
•Augment retrieval-augmented systems that pack many snippets by scaling the snippet delimiters for cleaner reasoning.

Version: 1