OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

Sheng-Yu Huang; Jaesung Choe; Yu-Chiang Frank Wang; Cheng Sun

OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

Intermediate

Sheng-Yu Huang, Jaesung Choe, Yu-Chiang Frank Wang et al.1/14/2026

arXiv PDF

Key Summary

•OpenVoxel is a training-free way to understand 3D scenes by grouping tiny 3D blocks (voxels) into objects and giving each object a clear caption.
•It uses SAM2 to get 2D masks from many photos, lifts them into 3D, and merges them into stable, view-consistent 3D object groups.
•Then it builds a scene map: each object gets a 3D position and a short, standardized caption written by a multimodal language model.
•When you ask a question like “the yellow dog toy with skinny legs,” OpenVoxel matches your words directly to the captions, no CLIP/BERT training needed.
•On hard referring expression segmentation (RES), it scores 42.4% mIoU—like jumping from a B- to a solid A compared to many prior methods.
•On open-vocabulary segmentation (OVS), it also performs competitively or better, showing it handles both simple and complex queries.
•It runs fast: about 3 minutes per scene versus 40–120+ minutes for methods that must train language features.
•Captions are canonicalized (same template every time), which makes matching sturdy even when wording changes.
•Limitations include sensitivity to segmentation parameters and difficulty selecting small parts of a larger object.
•Because it’s training-free and text-to-text, it’s easy to adapt, explain, and extend to new scenes without costly fine-tuning.

Why This Research Matters

OpenVoxel makes 3D scene understanding fast, flexible, and easy to deploy because it avoids training language features per scene. That means a robot or AR assistant can scan a new space, label objects clearly, and immediately answer natural questions. It handles everyday language, not just short labels, which matches how people actually talk. Since the system explains itself through readable captions, it’s more transparent and easier to debug. Faster setup (minutes, not hours) reduces cost and makes it practical for homes, warehouses, classrooms, and hospitals. As better multimodal models arrive, OpenVoxel can upgrade instantly without changing its simple, powerful recipe.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you and your friends built a LEGO city and took photos from many angles. Now someone asks, “Where’s the tiny yellow car next to the blue house?” You’d scan the city and point to it. Teaching a computer to do that in 3D used to be slow and fussy.

Before this work, 3D scene understanding often relied on powerful but training-hungry methods like NeRF or 3D Gaussian Splatting. These methods could render beautiful new viewpoints, but when you wanted the computer to answer language questions—like “find the cup on the table”—researchers had to train special language features inside the 3D model. That meant gathering data, running long training jobs, and aligning text and visuals in a fixed embedding space (like CLIP/BERT). It worked okay for short labels (like “cup” or “chair”), but it struggled with long, detailed sentences (like “the camera with the round lens sitting behind the plant”).

The problem: people don’t always use the same simple labels. We describe color, shape, material, and relations (“left of,” “behind,” “near”), and we can be creative. Old systems learned a single language space and hoped all queries would land close to the right object there. In practice, they often missed the mark on complex, human-style descriptions and needed costly, scene-specific training with human-annotated masks and sentences.

Failed attempts included distilling 2D language features into 3D (e.g., CLIP features baked into Gaussians), contrastive training to separate objects, and even training on per-scene referring expressions. These helped but at the price of time, annotation effort, and brittleness. When the sentence changed slightly—or mentioned subtle attributes—performance could wobble. Plus, to deploy on new scenes, you’d often need to repeat training or at least heavy feature extraction.

The gap was clear: What if we could skip training 3D language features entirely, yet still understand rich, free-form language? What if we could let strong multimodal language models reason in plain text instead of squeezing everything into a fixed vector space? And could we do this while keeping the 3D geometry consistent across views, so “the toy dog” means the same thing from every angle?

Enter a simple but powerful idea: build an object-level 3D map first (group voxels into real, view-consistent instances), then caption each object in clear, standardized language, and finally answer questions by matching text to text. No gradient descent. No embedding-space gymnastics. Just a good 3D grouping plus good language.

Real stakes: This matters anywhere a robot, AR app, or digital assistant must understand a space without long training sessions. Think tidying robots that hear “pick up the small red cup near the sink,” AR games that highlight “the old book with the green spine on the middle shelf,” or warehouse drones told “scan the sealed white box under the metal rack.” If systems can skip retraining and still reason about new phrases, setup gets faster, cheaper, and easier to explain.

02Core Idea

🍞 Hook: You know how librarians don’t memorize every book’s secret code? They place books on clear shelves with helpful labels. Then when you ask for “the blue mystery novel about a lighthouse,” they just read labels and pick the right one.

🥬 The Concept: The key insight is to avoid training a 3D language feature at all. Instead, group the 3D scene into objects, write a clear caption for each object, and answer your question by matching your words directly to those captions.

How it works (high level):
1. Reconstruct a 3D scene as sparse voxels from many photos (fast rendering, easy to handle).
2. Use SAM2 masks from each view to group voxels into objects, keeping them consistent across views.
3. For each object, render views and ask a multimodal LLM to produce a canonical (standard-form) caption.
4. When a user asks a question, rewrite it into the same template and do text-to-text matching against the scene map.
Why it matters: Without this, you must train language features per scene or rely on fixed embeddings that may miss nuanced, human-style descriptions. Here, we leverage language models’ native strength—reading and writing text—so the system handles richer queries without extra training.

🍞 Anchor: Imagine your room scanned in 3D. Each object—“teddy bear, brown fur, sitting, on chair”—has a label card. You say, “the soft brown bear on the chair,” and the system points to the same labeled object. No retraining; just read the card.

Multiple Analogies:

Grocery store: Instead of memorizing barcodes (embeddings), we put readable labels on shelves (“ripe bananas, yellow, in fruit section”). Shoppers’ requests match the labels directly.
Map app: We don’t train on your voice; we mark places with descriptions (“playground, slides, near river”), then match your spoken request to the labels.
School cubbies: Each cubby has a name tag and notes. Finding “the red backpack under the art poster” is matching notes, not retraining your eyes.

Before vs After:

Before: Train 3D features to match a fixed text space; decent for short tags but clunky for long sentences; slow to set up; needs annotations.
After: Training-free grouping + captions; flexible language; fast per scene; explainable by reading captions.

Why It Works (intuition):

Grouping: Objects stay the same across views, so we need a stable instance grouping. Using per-view masks and 3D voting creates a consensus center for each object.
Canonical captions: Language is messy. Standardizing into a fixed template makes matching robust to wording changes.
Text-to-text retrieval: Let language models do what they do best—compare meanings in words, not in rigid vector spaces.

Building Blocks (introduced with Sandwich cards below):

Sparse Voxel Rasterization (SVR)
Vision-Language Models (VLMs)
Multimodal Large Language Models (MLLMs)
Group Field Construction
Canonical Scene Map Construction
Referring Query Inference
Referring Expression Segmentation (RES)
Open-Vocabulary Segmentation (OVS)
OpenVoxel (the full pipeline)

Concept Cards (Sandwich Pattern):

SVR (Sparse Voxel Rasterization) 🍞 You know how Minecraft worlds are made of tiny cubes? SVR is like that for real scenes. 🥬 What: A fast way to represent and render 3D scenes using a sparse set of tiny cubes (voxels).

How: From many photos, build a 3D grid that only keeps cubes where there’s something real; render by stacking each cube’s color and see-through level along camera rays.
Why: Without SVR, updates and rendering would be slow or heavy, making grouping and captioning impractical. 🍞 Example: A desk scene becomes a light-weight set of cubes forming the mug, book, and lamp, ready for fast rendering and analysis.

VLMs (Vision-Language Models) 🍞 Imagine a friend who can describe a picture and also read the caption under it. 🥬 What: Models that understand images and text together.

How: They learn from many image–text pairs to connect what they see with what words mean.
Why: Without VLMs, the system can’t produce grounded descriptions tied to visuals. 🍞 Example: Given a masked photo of a mug, a VLM helps say, “mug, white ceramic, handle, on table.”

MLLMs (Multimodal Large Language Models) 🍞 Think of a smart helper that can read, look, and reason at once. 🥬 What: Big language models that also take images (and sometimes video) as input.

How: They blend text understanding with visual clues to write, refine, and compare descriptions.
Why: Without MLLMs, we couldn’t reliably standardize captions or match complex queries. 🍞 Example: The model rewrites “a green round object, maybe apple” into “apple, light green, smooth surface, on table.”

Group Field Construction 🍞 Picture sorting puzzle pieces from many photos into the same bag if they show the same object. 🥬 What: A training-free way to assign voxels to object instances using per-view masks and 3D voting.

How: Lift 2D masks into 3D, compute per-object centers, let voxels vote for the nearest center across views, and merge overlapping masks.
Why: Without consistent groups, captions would mix parts of different objects and queries would fail. 🍞 Example: All voxels of the toy dog cluster around one center, even from different camera angles.

Canonical Scene Map Construction 🍞 Imagine writing neat label cards for each object so anyone can find them. 🥬 What: Build a map listing each object’s 3D position and a standardized caption template.

How: Render object views, caption with DAM/VLM, then have an MLLM rewrite into a fixed format: category, appearance, function/part-of, placement.
Why: Without canonical captions, wording differences make matching unreliable. 🍞 Example: “toy dog, yellow, smooth plastic, on table” stored with its 3D center.

Referring Query Inference 🍞 Think of a librarian matching your request to the exact book label. 🥬 What: Rewrite the user’s query into the same template and match it to scene-map captions.

How: Canonicalize the query (short phrase), compare text-to-text with the scene map, then render the matched object’s mask.
Why: Without this step, natural language requests wouldn’t connect neatly to the 3D objects. 🍞 Example: “funny toy with spindly legs” → “toy, yellow, slim legs,” which matches the toy dog’s caption.

RES (Referring Expression Segmentation) 🍞 Like playing I-Spy in 3D: “I spy a cup near the bottle.” 🥬 What: Segment the part of the scene that a sentence refers to.

How: Interpret attributes and relations, select the right object group, and render its mask in the target view.
Why: Without accurate RES, assistants can’t follow natural, detailed requests. 🍞 Example: “the sake cup near the bottle” highlights only the small cup, not the whole table.

OVS (Open-Vocabulary Segmentation) 🍞 Like recognizing a new animal at the zoo because you understand clues, not just names you’ve memorized. 🥬 What: Segment objects even if they weren’t seen during training.

How: Use captions and flexible language matching rather than fixed, trained tags.
Why: Without OVS, systems only handle a closed list of labels. 🍞 Example: “ukulele” is segmented correctly even if it wasn’t a training class, because the caption says “small wooden instrument with strings.”

OpenVoxel 🍞 Imagine a tool that turns a messy room scan into neatly labeled objects without any extra study time. 🥬 What: A training-free pipeline to group voxels, caption objects, and answer language queries by text-to-text retrieval.

How: SVR for 3D, SAM2 masks lifted into 3D for grouping, MLLMs for canonical captions and matching, and fast rendering of results.
Why: Without OpenVoxel’s training-free design, you’d need long training and still struggle with complex sentences. 🍞 Example: In about 3 minutes, a new scene becomes a searchable 3D catalog you can talk to.

03Methodology

🍞 Hook: Think of building a museum guide in three steps: first, place each artifact in the right room (grouping). Next, write a clear label for each artifact (captioning). Finally, answer visitors’ questions by reading the labels (retrieval).

At a high level: Multi-view images → SVR 3D voxels → Training-Free Sparse Voxel Grouping → Canonical Scene Map Construction → Referring Query Inference → Output masks + matched captions.

Step 0: Input and SVR backbone

We start with many photos and camera poses. Using Sparse Voxel Rasterization (SVR), we reconstruct a sparse voxel field that renders fast and keeps per-voxel attributes like color and opacity. This lets us project between 2D views and 3D voxels efficiently, which is vital for our grouping recipe.

Step 1: Training-Free Sparse Voxel Grouping (Group Field Construction) 🍞 You know how you can tell puzzle pieces belong together because they fit around the same center picture? We do that in 3D. 🥬 What: Assign each voxel a 3D “vote” pointing to the center of the object it belongs to; progressively update these votes using per-view instance masks.

How (recipe-like):
1. Per-view masks: Run SAM2 on each image to get instance masks. For each masked instance, compute its 3D centroid by averaging the ray-hit 3D points under that mask.
2. Vote accumulation: For each voxel, add a weighted contribution toward the mask’s centroid based on how much that voxel contributes to each pixel (volume-rendering weights). Keep a confidence weight per voxel.
3. Start condition: Initialize votes and weights using the first processed view’s masks; start a group dictionary of instance IDs and their 3D centers.
4. Match and merge across views: Render current group assignments back into the next view, match them with that view’s SAM2 masks by IoU, and relabel matched instances. For noisy over-segmentation (like a fingertip split from a hand), re-prompt SAM2 to merge overlapping small masks.
5. Finalize: After all views, each voxel picks the nearest existing group center according to its normalized vote, producing stable, view-consistent 3D instances.
Why: Without voting and cross-view matching, objects would fragment differently per view, wrecking later captioning and query matching. 🍞 Example: The “toy dog” voxels all converge on one 3D center even when seen from front, left, or above; small stray fragments get merged back.

Step 2: Canonical Scene Map Construction 🍞 Museums have neat labels so any guide can help any visitor. We do the same for 3D objects. 🥬 What: For each 3D instance, produce a short, standardized caption and record its 3D center—together forming a scene map S.

How (recipe-like):
1. Render masked views: For each group, render its binary mask over multiple views and darken everything outside the mask. Add a tiny red dot to focus the model (visual prompt).
2. Detailed draft caption: Feed these masked frames to a captioning model (e.g., DAM) to get a rich description. It may say “a green round object, possibly an apple…”
3. Canonicalize with MLLM: Ask a multimodal LLM (e.g., Qwen3-VL-8B) to rewrite the draft into a strict template: <category>, <appearance>, <function/part-of>, <placement>. This removes vagueness (“object”) and normalizes phrasing.
4. Store entry: For each group, store {id, center, canonical caption} in S.
Why: Without canonical captions, different wording makes retrieval brittle. Standard forms make matching stable and interpretable. 🍞 Example: The green fruit becomes “apple, light green, smooth surface, on table,” with its 3D center recorded.

Step 3: Referring Query Inference 🍞 When someone asks, “Where’s the yellow toy with thin legs?”, we turn their words into a neat label and match it to the right object. 🥬 What: Turn the query into the same canonical template and do text-to-text retrieval over the scene map.

How (recipe-like):
1. Query refinement: Use the MLLM to rewrite the user’s request (with or without a view image) into a short canonical phrase: “toy, yellow, slim legs.”
2. Caption-first matching: Compare this phrase to every scene-map caption. Prefer exact/synonym matches on category, then align attributes. If multiple groups describe parts of one object, return all relevant IDs.
3. Optional world-geometry checks: If the query mentions view-independent relations like “closest” or “between,” use stored 3D centers to apply those constraints.
4. Render result: With the selected group IDs, rasterize only those voxels to produce the final 2D mask for the requested view and report the exact matched caption(s).
Why: Without canonicalize-and-match, variable human phrasing would cause misses; using text-to-text lets the language model reason flexibly. 🍞 Example: “A drinking utensil near the sake bottle” selects the small sake cup’s group and renders a crisp mask in the target view.

Secret Sauce (what’s clever):

Training-free grouping via 3D voting: We replace high-dimensional learned features with a simple, robust 3D centroid voting scheme that aggregates across views using rendering weights.
Canonical language bridge: By forcing both object descriptions and queries into the same, short template, we turn a fuzzy language problem into a tidy label-matching problem.
Text-to-text over fixed embeddings: We sidestep the limits of a single embedding manifold and let MLLMs compare meanings directly in natural language, improving flexibility for complex queries.

Concrete Data Walkthrough:

Suppose a scene has 3 objects: {apple, toy dog, camera}.
1. Grouping: Across 120 views, SAM2 produces instance masks. Voxels vote for centers; fingertip fragments get merged; final IDs: {1: apple, 2: toy dog, 3: camera}.
2. Captioning: DAM drafts: “green round object…,” “yellow toy with legs…,” “black and white device with lens…”. MLLM canonicalizes:
  - ID1: “apple, light green, smooth surface, on table”
  - ID2: “toy dog, yellow, slim legs, on table”
  - ID3: “camera, black and white, round lens, on counter”
3. Query: “A funny toy with spindly legs.” → “toy, yellow, slim legs.” Match to ID2; render its 2D mask.

Runtime Tips (engineering pragmatics described in the paper):

Process up to ~150 views; perform SAM2-based merging every few steps to save time; sample 8 mask–frame pairs per group for captioning to keep MLLM input small. The whole pipeline per scene takes ~3 minutes and answers queries in under a second.

04Experiments & Results

🍞 Hook: Think of a school competition where teams must find the right object in a 3D scene based on tricky riddles. Some teams trained for weeks learning code words; one team simply labeled the objects clearly and read the labels during the game.

The Tests (what and why):

RES (Referring Expression Segmentation): Given a natural-language sentence (“the cup near the bottle”), produce an accurate mask of that object in the image. This stresses nuanced language and relations.
OVS (Open-Vocabulary Segmentation): Given a simple category name (“cup,” “camera”), produce the mask. This checks generalization to labels beyond a fixed training list.
Datasets: LeRF subsets captured by iPhone/Polycam: Ref-LeRF (RES with 4 scenes), LeRF-OVS (OVS with 3 scenes), and LeRF-Mask (simpler OVS with fewer, less ambiguous queries). These are standard in recent 3D language–aware work, so comparisons are fair and meaningful.

The Competition (baselines):

Methods that embed language into 3D features: LangSplat, OpenGaussian variants, GOI, GS-Grouping.
A strong RES method: ReferSplat, which trains with human-annotated sentence–mask pairs.
Some variants and re-implementations are reported, to ensure coverage across scenes.

Scoreboard with Context:

RES (Ref-LeRF, mIoU):
- OpenVoxel: 42.4%
- ReferSplat (paper): 29.2% (OpenVoxel is +13.2 points—like jumping from a B- to an A)
- ReferSplat (reproduced): 24.5% (OpenVoxel is +17.9 points—more than a full letter grade)
- Others (e.g., GOI, GS-Grouping): notably lower on these complex sentences. Meaning: On the toughest, most natural language tests, OpenVoxel’s caption-first, text-to-text approach shines. It’s especially strong when sentences include attributes (“slim legs”), affordances (“used for cutting paper”), or relations (“near the apple”).
OVS (LeRF-OVS, mIoU):
- OpenVoxel: 66.2%
- 3DVLGS: 64.3%
- CCL-LGS: 65.1%
- LangSplat: 53.7% Meaning: Even for simpler category queries, OpenVoxel stays competitive or better. Canonical captions plus direct text matching help consistently pick the right object.
OVS (LeRF-Mask, mIoU/mBIoU):
- OpenVoxel: mIoU 87.2, mBIoU 81.4
- Strong baselines also surpass 70% mIoU here. Because queries are fewer and clearer, many methods do well; OpenVoxel remains among the top performers.

Qualitative Highlights (what you can see):

Figurines scene: For “A minimalist style toy with natural grooves next to a red apple,” some methods latch onto “near the apple” and pick the wrong patch. OpenVoxel retrieves the correct toy pumpkin with a smooth, complete mask.
Ramen scene: For “A drinking utensil near a sake bottle,” some methods mask both glass and bottle; OpenVoxel isolates the small sake cup correctly.

Surprising Findings:

Training-free can be stronger on natural language: Learning sentence embeddings for 3D sometimes overfits to seen phrasings, causing unstable performance on new wordings. OpenVoxel’s text-to-text retrieval avoids this and proves robust.
Speed: OpenVoxel completes grouping + captioning in ~3 minutes per scene on a single RTX 5090 GPU, versus ~40 min to >1 hr for training-reliant methods. That’s at least 10× faster setup for new scenes, with sub-second query time.

Ablations (what parts really matter):

Mask merging helps: Removing SAM2-based merging of small overlaps drops RES mIoU from 28.0 to 24.3—reduces noise.
Canonical captions matter a lot: Adding canonicalization boosts from 28.0 to 36.4; standard phrasing dramatically stabilizes matching.
Canonical queries finish the job: Final step to 42.4% mIoU by aligning query format to object captions.
Model choices: SAM2 > SAM for grouping (SAM tends to over-fragment); DAM > other captioners (better consistency for masked regions); larger Qwen3-VL models improve canonicalization and retrieval.

Takeaway: OpenVoxel hits a sweet spot—clear instance grouping, standardized captions, and direct text-to-text matching—delivering both accuracy and speed on real, language-heavy 3D tasks.

05Discussion & Limitations

🍞 Hook: If you label every drawer in your kitchen, you can find things fast—but if the labels are sloppy or the drawers are oddly split, you’ll still fumble sometimes.

Limitations:

Parameter sensitivity: Because grouping assembles objects from per-view masks, settings like how often to merge or how many views to sample matter. Over-fragmentation can create many tiny groups; under-merging can split one object into parts.
Parts vs wholes: The pipeline groups at the instance level. If a query asks for a small part (e.g., “camera flash”), OpenVoxel may return the whole camera unless the part is segmented as its own instance during grouping.
MLLM dependence: Canonicalization and retrieval rely on an MLLM. Turning off chain-of-thought speeds things up but limits complex reasoning; smaller MLLMs can underperform on template rewriting or matching.

Required Resources:

A pre-trained SVR scene per environment (from multi-view images and camera poses).
A 2D segmentation foundation model (SAM2) for instance masks.
A captioning model (e.g., DAM) and an MLLM (e.g., Qwen3-VL-8B) for canonicalization and retrieval.
A single modern GPU makes the pipeline fast (~3 minutes per scene), but CPUs/older GPUs may take longer.

When NOT to Use:

Ultra-fine part queries without fine-grained masks available (e.g., “the tiny screw head on the handle”).
Scenes with extremely poor segmentation quality where SAM2 cannot produce meaningful masks.
Settings needing heavy view-dependent relations (“left of” relative to the current camera) if the system is restricted to view-independent world relations.

Open Questions:

Richer scene graphs: Could we store object-to-object relations (on top of, between, part-of) explicitly to answer more complex spatial queries robustly?
Better part handling: Can we create hierarchical groups (object ↔ parts) so the system can pick either the whole or just the requested component?
MLLM advances: As faster, stronger MLLMs appear, how much can reasoning and matching improve without sacrificing the under-1-second query speed?
Adaptive prompting: Could tailored prompts per model size (2B/4B/8B) squeeze more accuracy without slowing inference?

Bottom line: OpenVoxel is practical and fast, but results depend on solid 2D masks and smart canonical language. It’s a strong base to build richer 3D language understanding with scene graphs and part hierarchies.

06Conclusion & Future Work

Three-Sentence Summary:

OpenVoxel shows you don’t need to train 3D language features to understand scenes: group voxels into consistent objects, canonically caption them, and match queries to captions.
This training-free, text-to-text approach handles both simple labels (OVS) and nuanced sentences (RES) and runs about 10× faster to set up new scenes.
On tough RES benchmarks, it achieves large gains over prior art, while staying competitive on OVS.

Main Achievement:

A clean, effective pipeline—Training-Free Sparse Voxel Grouping + Canonical Scene Map + Referring Query Inference—that replaces heavy embedding training with readable captions and direct language reasoning.

Future Directions:

Add an explicit scene graph (relations like on, in, between; part-of links) to boost spatial and part-level queries.
Explore hierarchical grouping for part–whole selection.
Leverage newer, faster MLLMs and improved prompts for better canonicalization and retrieval.

Why Remember This:

OpenVoxel flips the script: instead of bending language into 3D features, it writes labels that people (and models) can read. That simplicity brings speed, accuracy on natural language, and easy deployment to new scenes—exactly what real-world 3D assistants and robots need.

Practical Applications

•Home robots: “Pick up the small blue cup near the sink” without retraining for each kitchen.
•AR shopping: Tap “the red shoe with white stripes on the second shelf” for instant highlights.
•Warehouse inventory: “Find the sealed white box under the metal rack” with quick 3D scans.
•Museum guides: “Show me the bronze statue with a smooth finish by the east window.”
•Construction/inspection: “Highlight the cracked panel next to the yellow beam.”
•Classroom demos: Students explore a 3D lab and ask for “the glass beaker on the counter.”
•Security monitoring: “Locate the unattended black bag near the entrance.”
•Retail analytics: “Count the green bottles on the middle shelf across all aisles.”
•Film/game set design: “Select all wooden props on the table with visible text or logos.”
•Accessible interfaces: Voice-driven object finding for people with low vision in 3D spaces.

Version: 1