ObjEmbed: Towards Universal Multimodal Object Embeddings

Shenghao Fu; Yukun Su; Fengyun Rao; Jing Lyu; Xiaohua Xie; Wei-Shi Zheng

ObjEmbed: Towards Universal Multimodal Object Embeddings

Intermediate

Shenghao Fu, Yukun Su, Fengyun Rao et al.2/2/2026

arXiv PDF

Key Summary

•ObjEmbed teaches an AI to understand not just whole pictures, but each object inside them, and to link those objects to the right words.
•It gives every detected object two tiny summaries (embeddings): one for meaning (what it is) and one for box quality (how well it is localized).
•The final object score is the product of semantic similarity and a predicted IoU score, so the AI prefers objects that both match the words and are tightly boxed.
•All objects and the full image are encoded together in one pass through a single multimodal language model, making it efficient.
•ObjEmbed works for object detection, referring expression comprehension, local image retrieval, and global image retrieval in one unified framework.
•On COCO detection it reaches 53.0% mAP, on RefCOCO/+/g it averages 89.5% accuracy, and on local image retrieval it outperforms prior models by about 20 points.
•A special sequence design and five custom tokens (object, iou, global, localtext, globaltext) make the model versatile without changing backbones between tasks.
•Training combines region-level contrastive learning, image-level contrastive learning, and IoU regression on a curated 1.3M-sample dataset.
•Performance still depends on the quality of region proposals, but when ground-truth boxes are mixed in, scores jump significantly.
•ObjEmbed shows a balanced, general-purpose way to represent objects that is both semantically sharp and spatially aware.

Why This Research Matters

ObjEmbed helps computers find the exact thing we’re talking about, even when it’s small or hard to see, and to trust that the bounding box is tight. That’s vital for safety tasks like reading distant traffic signs or robots picking tiny parts. It also makes search engines more useful by finding images that match very specific phrases, like a logo on a shirt or a license plate number. Because it keeps strong global understanding too, it fits many apps without switching models. The design is efficient, so it can process many objects at once without slowing way down. Overall, it brings meaning and precision together in one practical system.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re sorting a big box of LEGO pieces while reading a short instruction card. It’s easy to match the whole scene (a castle!) with the instructions, but much harder to find the tiny red door piece hidden in the pile when the card says “attach the small red door.”

🥬 Filling (The Actual Concept): Before ObjEmbed, most models were great at matching a whole image with a sentence (global alignment), but not so great at finding and matching the small parts inside that image with precise phrases (fine-grained alignment). They could say, “This picture matches the caption,” but struggled with, “Which exact object in this picture matches this phrase?”

What it is: ObjEmbed is a system that represents each object in an image with two compact summaries (embeddings): one for meaning and one for localization quality, plus a global image summary.
How it works: It proposes candidate regions, turns each into two special tokens (one for semantics, one for IoU/box quality), sends them with the global image to a multimodal language model, and learns to rank objects by multiplying semantic match with predicted IoU.
Why it matters: Without it, the model might pick an object that sounds right but is poorly boxed, or a well-boxed object that doesn’t match the words—leading to mistakes in safety-critical tasks.

🍞 Bottom Bread (Anchor): Think of searching photos for “the tiny stop sign far away.” A global-only system might miss it, but ObjEmbed finds the small sign and checks its box tightness so the right image ranks first.

Now let’s introduce the key ideas in the right order so they feel familiar and clear.

🍞 You know how a comic book has pictures and speech bubbles you read together? 🥬 Multimodal Embedding Models:

What it is: These are models that turn different kinds of data (like images and text) into comparable numbers so they can be matched.
How it works: They convert an image into a vector, convert a sentence into another vector, and bring close matches nearer and push mismatches apart.
Why it matters: Without a shared space, the model can’t tell which words match which images. 🍞 Anchor: When you ask for “a dog playing frisbee,” the model finds images whose vectors sit closest to that sentence vector.

🍞 You know how you sort your backpack by items (books, pencil case, lunch) to stay organized? 🥬 Object-Oriented Representation:

What it is: A way to represent each object in an image separately, not just the whole image.
How it works: The system detects potential objects and gives each its own embedding, keeping track of both what it is and where it is.
Why it matters: Without object focus, small or specific items get lost inside the whole-scene summary. 🍞 Anchor: If the text says “the yellow lollipop,” the model can check the lollipop’s own embedding instead of the entire picture’s embedding.

🍞 Imagine tracing two shapes on transparent sheets and sliding them together to see overlap. 🥬 IoU (Intersection over Union):

What it is: A score that tells how well a predicted box overlaps the correct box.
How it works: It measures the area where boxes overlap divided by the area covered by either box.
Why it matters: Without a measure like IoU, the model can’t tell if it pointed to the object precisely or sloppily. 🍞 Anchor: A perfectly aligned box around a stop sign has high IoU; a box that only half-covers it has lower IoU.

🍞 You know how a magnifying glass helps you focus on one tiny spot of a picture? 🥬 Region of Interest (RoI):

What it is: A selected part of an image the model looks at closely.
How it works: A proposal generator suggests candidate boxes likely to contain objects; features from those regions are extracted.
Why it matters: Without RoIs, the model wastes effort on empty areas and misses small objects. 🍞 Anchor: For “number 8 jersey,” the RoI zeroes in on a player’s chest area where the number appears.

🍞 Think of playing ‘spot the difference’ between two images. 🥬 Contrastive Learning:

What it is: A training style where matching pairs are pulled closer and non-matching pairs are pushed apart.
How it works: The model learns by comparing, rewarding correct pairings and discouraging wrong ones.
Why it matters: Without contrasts, the model can’t sharpen what makes items different. 🍞 Anchor: The phrase “red umbrella” should end up close to images with red umbrellas and far from blue umbrellas.

🍞 Imagine a friend pointing and saying, “That cup on the left of the plate.” 🥬 Visual Grounding:

What it is: Linking words to the exact spot of the matching object in the image.
How it works: The AI compares text to object embeddings and picks the correctly located one.
Why it matters: Without grounding, the AI can’t be sure which instance of an object the text means. 🍞 Anchor: For “the cat on the right sofa cushion,” the model selects the right-side cat’s box, not the left.

🍞 Suppose you hear about a fruit you’ve never seen, like ‘dragon fruit,’ and learn to spot it from a description. 🥬 Open-Vocabulary Object Detection:

What it is: Detecting objects described by words, even if their class wasn’t seen in training.
How it works: The detector aligns region features with text embeddings, so descriptions guide recognition.
Why it matters: Without open vocabularies, systems only recognize a fixed list of labels. 🍞 Anchor: Given “a transformer cabinet with yellow lightning signs,” the model can detect it despite not having a fixed class for it.

What was missing? Past models did not explicitly judge how good each bounding box was when matching text to objects. So they could say “this looks like a lollipop,” but not “this is the best boxed lollipop.” ObjEmbed fills that gap by learning a dedicated IoU embedding per object and multiplying it with the semantic match. That makes the model prefer objects that are both the right thing and well-framed.

Why should anyone care? In real life, details matter. Self-driving cars must read small traffic signs; robots must grasp tiny parts; search engines must find the correct product logo in a busy shelf. ObjEmbed brings the precision and the meaning together—like having both sharp eyes and a smart brain—so the right tiny thing is found, named, and trusted.

02Core Idea

🍞 Top Bread (Hook): You know how a great coach looks for both the best player and the player who’s in the best position to score right now? Picking who to pass to is about skill and position.

🥬 Filling (The Actual Concept): The big idea of ObjEmbed in one sentence: Separate an object’s meaning from its box quality using two tokens, then multiply those scores so the best-matched and best-localized object wins—while still keeping a global image embedding for classic retrieval.

Multiple analogies (3 ways):

Librarian analogy: The ‘object token’ is like a book’s title card (what it’s about). The ‘IoU token’ is like a sticky note saying “shelved perfectly” or “misplaced.” You pick books that match the topic and are correctly shelved.
Treasure map analogy: The description says what treasure looks like (semantic), while a confidence meter tells how accurate the X on the map is (IoU). You choose the treasure spot that both matches the clue and has a confident X.
Photo search analogy: You search “the player wearing number 8.” The AI finds all player candidates (semantics) and ranks higher the one whose box tightly covers the number (IoU), so you get the right player, not a guessy one.

Before vs. After:

Before: Models did strong whole-image matching and sometimes regional matching, but without judging how good the box was. This could return the right object class but a sloppy box, or miss tiny instances.
After: ObjEmbed explicitly learns IoU quality alongside semantics and multiplies them for scoring. It also keeps global embeddings so it works for both local and global retrieval. Result: better small-object recall, sharper localization, and more trustworthy matches.

Why it works (intuition):

Decoupling reduces conflicts. One token focuses on “what” (semantics), the other on “how well located” (IoU). If both were jammed into one token, training signals could clash.
The product rule enforces agreement. If either meaning or localization is weak, the final score stays modest; only objects that are both right and well-boxed score high.
Single-tower shared space. Text and visual tokens share the same LLM encoder, so they speak the same “language,” improving cross-modal matching.
All-at-once encoding. Putting all objects (and the global image) in a single pass gives consistent context and efficiency without autoregressive delays.

Building blocks (with sandwich mini-explanations where new):

🍞 You know how you circle likely items in a picture before looking closely? 🥬 Proposal Generator:
- What it is: A tool that suggests candidate boxes likely to contain objects.
- How it works: It scans the image, proposes top-N regions, and passes them forward.
- Why it matters: Without good proposals, the right object might never be considered. 🍞 Anchor: For a street scene, it proposes boxes around signs, cars, and people.
🍞 Imagine turning a whole paragraph into a short, meaningful summary. 🥬 Embedding:
- What it is: A compact vector that captures the essence of something (object or text).
- How it works: The model encodes inputs into numbers where similar things are close.
- Why it matters: Without embeddings, the AI can’t compare pictures and words fairly. 🍞 Anchor: “red umbrella” and an umbrella region share nearby vectors.
🍞 Think of two teammates: one identifies who has the ball; the other checks if they’re in bounds. 🥬 IoU Embedding (new twist on IoU):
- What it is: A learned token whose hidden state predicts the box quality (IoU) for each object.
- How it works: For every object token, an IoU token follows; a small head predicts IoU.
- Why it matters: Without it, the system can’t prefer tightly boxed matches over sloppy ones. 🍞 Anchor: Two similar “stop sign” boxes exist; the tighter one gets a higher IoU score.
🍞 You know how you might want both a postcard view and a zoomed detail? 🥬 Global vs. Local Text Tokens:
- What it is: Separate tokens for matching whole images (globaltext) and matching objects (localtext).
- How it works: Prompts guide whether we’re doing global image retrieval or object-level matching.
- Why it matters: Without separation, the model might confuse tasks with different goals. 🍞 Anchor: “Find an image of a beach at sunset” (global) vs. “Find the blue surfboard” (local).
🍞 Like a recipe that needs the right balance of flavors. 🥬 Contrastive Objectives (region-level and image-level) + IoU loss:
- What it is: Three learning signals: match text to regions, match text to images, and learn IoU quality.
- How it works: Sigmoid focal losses handle many-to-one region matches and scalable image negatives; a focal loss also trains IoU prediction on positives.
- Why it matters: Without all three, you miss either semantics, global alignment, or localization quality. 🍞 Anchor: The model learns to rank “yellow lollipop” near the right small region and to score that region’s box quality well.

Put together, these parts make a single, efficient, object-savvy embedding model that can do detection, grounding, local retrieval, and global retrieval without switching architectures.

03Methodology

High-level recipe: Input image → Proposal generator + RoIAlign → Build token sequence (object + iou pairs, global tokens, task instructions) → Single-pass encoding with one LLM → Compute similarities and scores → Train with region contrastive + image contrastive + IoU regression.

Step-by-step details:

Propose likely objects (WeDetect-Uni) and extract RoI features.

What happens: For each image, produce top-N (e.g., 100) candidate boxes and use RoIAlign to pull features from each region.
Why it exists: If you don’t shortlist regions, you’ll either miss small items or waste compute on empty areas.
Example: In a soccer photo, proposals include players, jersey numbers, ball, and goal frame.

Compress each RoI into an object token via an object projector.

What happens: A small network turns each RoI feature into a single token embedding that replaces an ⟨object⟩ placeholder.
Why it exists: The LLM expects token sequences, so each region must become a compact token.
Example: The jersey-number region becomes a single object token carrying fine-grained details.

For each object token, add a following IoU token.

What happens: Build a structured sequence like: “Object i: ⟨object⟩⟨iou⟩.” The IoU token learns to predict box quality.
Why it exists: Without decoupling, meaning and box quality signals fight inside one token, hurting both.
Example: Two overlapping boxes around the same stop sign will get different IoU predictions, helping rank the tighter one higher.

Insert two global image tokens for coarse and detailed global embeddings.

What happens: Place ⟨global⟩ tokens (two copies) into the sequence: one supervised by short captions, the other by long captions.
Why it exists: Short and long captions train complementary global views; two tokens prevent mixing signals.
Example: Short: “A gray fighter jet on a runway.” Long: adds positions, counts, and relationships.

Add task instructions and object separators.

What happens: Prompts like “Object i:” make each object distinct; task-specific instructions (for detection vs. referring) steer the features.
Why it exists: Without instructions, the model blurs tasks and objects, reducing accuracy.
Example: For referring expressions, instructions emphasize unique instance details and relations (“to the left of …”).

Encode everything in one pass through a single multimodal LLM (Qwen3-VL-Instruct backbone).

What happens: Vision tokens (the full image), object+IoU tokens, and global tokens flow together in one forward pass; no autoregressive decoding is needed.
Why it exists: Single-pass encoding is efficient and consistent; all objects “see” the same context.
Example: With 100 objects (8 tokens each) and ~1000 image tokens, the total stays under ~2000 tokens; FlashAttention-2 speeds it up.

Encode text queries with the same LLM using special text tokens.

What happens: For object-level queries, use: “Find an object that matches the given caption. CAPTION ⟨localtext⟩.” For image-level queries, use the global version with ⟨globaltext⟩.
Why it exists: Sharing the backbone ensures text and visual embeddings live in the same space.
Example: “the player wearing the number 8 jersey” produces a localtext embedding that will match jersey-region object tokens.

Compute scores.

What happens: For each object and a local text embedding, compute cosine similarity (classification score). Predict IoU from the IoU embedding. Multiply them for the final object score.
Why it exists: Multiplication rewards objects that are both semantically right and well localized.
Example: Two similar “yellow lollipop” regions get similar semantic scores, but the tighter-box one gets higher IoU, thus higher final score.

Aggregate for retrieval.

Local image retrieval (text-to-image): Score each image by taking the maximum object score across its objects. Rank images by this max.
Global image retrieval: Use the global image embeddings and global text embeddings directly.
Why it exists: Max-over-objects lets a small target drive the whole image ranking; global retains classic tasks.
Example: Query: “a tiny distant traffic sign.” Even if it’s 3% of the image, the best-matching sign region sets the image score.

Train with three objectives.

Region-level contrastive learning: For each object description, proposals with IoU>0.5 are positives; others are negatives. Train with sigmoid focal loss to handle many-to-one and class imbalance.
Image-level contrastive learning: Also sigmoid focal; supervise one global token with short captions, the other with long captions; share negatives across devices.
IoU regression: Predict IoU for positive proposals using focal loss on the IoU score.
Why it exists: Together, they teach semantics at both region and image level, plus localization quality.
Example: The phrase “the red and white locomotive with a number on the front” pulls the matching region close; the IoU head learns how tight its box is.

Data construction and instructions.

Data: 1.3M images, 8.1M boxes from detection/REC datasets plus SA-1B and licensed web images; boxes from WeDetect-Uni; object-level unique descriptions and image captions from a large annotator model.
Instructions: Different prompts for detection vs referring guide the model to learn class-typical features vs instance-unique details.
Why it exists: Diverse, well-structured supervision improves generalization and reduces false negatives.

Secret sauce (what’s clever):

The decoupled two-token design per object (object + IoU) solves the tug-of-war between “what” and “how well boxed.”
The product scoring elegantly enforces joint correctness.
Two global tokens let short and long captions teach complementary global views.
One-pass, single-tower encoding keeps it fast and unified across tasks.

04Experiments & Results

The test: The authors checked if ObjEmbed can (a) detect and classify objects well, (b) understand referring phrases to pick the right instance, (c) retrieve images from small-object text queries (local retrieval), and (d) still do classic global image-text retrieval.

The competition: They compared against strong open-vocabulary detectors (GLIP, OWLv2, Grounding-DINO, LLMDet, WeDetect) and modern multimodal embedding models (FG-CLIP2, Qwen3-VL-Embedding, GME, VLM2Vec, UME-R1, etc.). They also noted that specialist detectors tend to excel at localization on known classes, while MLLMs excel at language understanding but often struggle with precise boxes.

The scoreboard (with context):

Object detection (COCO mAP): ObjEmbed-4B gets 53.0% mAP. That’s like an A when many general-purpose MLLMs are getting C’s on localization-heavy tests; it’s competitive with strong open-vocab detectors while keeping broad language skills.
Referring expression comprehension (RefCOCO/+/g accuracy@0.5 IoU): ObjEmbed-4B averages 89.5. Think of it like answering nearly 9 out of 10 pointing questions correctly in crowded scenes, on par with or better than larger MLLMs specialized for referral.
Local image retrieval (SORCE-1K, REIRCOCO, ILIAS): ObjEmbed-4B averages 68.5, beating global embedding baselines by roughly 20 points. That’s a big jump—like going from a B- to a solid A—especially impressive because local retrieval needs small-object sensitivity and fine-grained matching.
Global image retrieval (ShareGPT4V, DCI, COCO, Flickr30K, multilingual COCO-CN, Flickr30K-CN): Despite training on a relatively small 1.3M set, ObjEmbed-4B averages 81.7 Recall@1 across diverse benchmarks. That’s competitive with top embedding models designed primarily for global tasks.

Surprising findings:

Local retrieval transfer: Even when image-to-image local retrieval wasn’t directly optimized, the global embeddings still worked well, suggesting the shared space learned by ObjEmbed generalizes across modalities.
Two global tokens help: Supervising one global token with short captions and another with long captions lifted global retrieval, showing complementary global viewpoints matter.
Decoupling wins: Turning the “one token that does everything” into two (object + IoU) improved detection mAP meaningfully. Predicting IoU inside classification labels helped, and decoupling it helped even more.
Proposal recall caps performance: When they mix in ground-truth boxes to proposals, performance jumps (e.g., +12.2 AP on COCO), showing ObjEmbed’s representation is strong, but recall of the proposal generator limits the ceiling.
Box regression inside the embedding model hurt: Trying to regress box coordinates from the IoU token at the same time degraded performance, likely due to training conflicts—reinforcing the “do one job well per token” insight.

What each result means in plain terms:

Detection: ObjEmbed sees objects with both brain (meaning) and ruler (box quality). That balance lets it hang with detectors while understanding complex language.
Referring: Given “the car’s license plate in HAWAII,” it finds the exact plate instance reliably.
Local retrieval: Queries about small parts (like a logo or a number) finally work well because the model looks over all objects and picks the best-matched, best-localized one.
Global retrieval: Keeping a strong global embedding means ObjEmbed doesn’t give up the classic tasks—it’s a generalist that also specializes in the small stuff.

Takeaway: ObjEmbed pushes embedding models from whole-scene generalists toward object-smart specialists—without sacrificing versatility—by explicitly learning and using localization quality.

05Discussion & Limitations

Limitations:

Proposal dependence: If the proposal generator misses an object, ObjEmbed can’t embed it. This caps performance, especially for very tiny or extremely crowded scenes when only top-N proposals are used.
Data scale: Trained on 1.3M samples—much smaller than some CLIP-style datasets—so there’s untapped potential in scaling data.
Hard negatives: While focal losses help, smarter hard-negative mining (without injecting false negatives) could sharpen discrimination further.
Box regression conflicts: Adding box-offset regression into the same framework degraded results, suggesting that coordinate prediction may conflict with the clean embedding objective.

Required resources:

Compute: Training used 16 GPUs with vision encoder frozen; single-pass encoding is efficient but upstream proposal generation and large LLM backbones still require capable hardware.
Data: Region-level supervision (detection and referring) plus carefully curated captions; auto-annotation with strong MLLMs helps but benefits from quality control.

When not to use:

End-to-end detection with box refinement only: If your main goal is precise coordinate regression and you don’t need embeddings or retrieval, a specialized detector trained end-to-end for box regression might be better.
Scenes with extreme object counts beyond proposal limits: If thousands of tiny instances matter and proposals are too few, consider increasing proposals or using dedicated dense detectors.

Open questions:

Can proposal generation be integrated or co-trained to raise recall without losing efficiency?
How far can performance scale with more data, harder negatives, and multilingual fine-grained supervision?
Can we extend object embeddings to instance masks or part-level embeddings for even finer grounding?
Is there a conflict-free way to add gentle box refinement without hurting embedding quality—perhaps via a separate auxiliary head or staged training?

06Conclusion & Future Work

Three-sentence summary: ObjEmbed is an object-centric embedding model that splits each object into two learned tokens—one for meaning and one for box quality—and multiplies their scores to favor objects that are both semantically correct and well-localized. It encodes all objects and the full image in a single pass through a shared multimodal LLM, supporting object detection, referring comprehension, local retrieval, and global retrieval. Across 18 benchmarks, it shows balanced, strong performance, especially shining in local retrieval where fine-grained object matching matters most.

Main achievement: The key contribution is the decoupled two-token per-object design (object + IoU) with product scoring, which neatly resolves the conflict between “what it is” and “how well it’s boxed,” bringing reliable, fine-grained grounding into an efficient, unified embedding model.

Future directions:

Scale up data and hard-negative mining to further improve discrimination.
Raise proposal recall via better proposal generators or joint training.
Explore mask-level or part-level embeddings for even finer grounding.
Investigate conflict-free, light box refinement strategies that preserve embedding quality.

Why remember this: ObjEmbed shows a simple, powerful principle—separate “meaning” from “localization quality,” then make them agree. That one idea unlocks accurate small-object retrieval and strong cross-modal grounding while keeping classic global retrieval performance, making it a practical, general-purpose foundation for visual understanding.

Practical Applications

•Driver assistance: Retrieve and prioritize camera frames containing small, critical signs (e.g., speed limits) with tight localization.
•Robotics: Find and grasp tiny parts on cluttered workbenches by matching text instructions to object embeddings.
•E-commerce search: Retrieve product photos matching small logos, patterns, or tags described in text queries.
•Document visual search: Locate and rank images containing specific labels or numbers (e.g., jersey numbers, license plates).
•Content moderation: Flag and localize sensitive symbols or items described in policy text within user images.
•Photo organization: Let users search personal albums for fine-grained items like “the red kite in the background.”
•Video keyframe retrieval: Find frames with specific small objects (e.g., “the yellow screwdriver on the left shelf”).
•Augmented reality: Ground user voice commands to exact objects (“highlight the blue resistor under the green wire”).
•Industrial inspection: Retrieve examples of tiny defects or markings from large image repositories with precise localization awareness.
•Education tools: Enable students to query images for particular parts (“the mitochondria in the cell image”) and see exact highlights.

Version: 1