Quantile Rendering: Efficiently Embedding High-dimensional Feature on 3D Gaussian Splatting

Yoonwoo Jeong; Cheng Sun; Frank Wang; Minsu Cho; Jaesung Choe

Quantile Rendering: Efficiently Embedding High-dimensional Feature on 3D Gaussian Splatting

Intermediate

Yoonwoo Jeong, Cheng Sun, Frank Wang et al.12/24/2025

arXiv PDF

Key Summary

•This paper speeds up how 3D scenes handle big, 512‑dimensional features without throwing away important information.
•Instead of mixing every single 3D blob (Gaussian) along a camera ray, it only mixes the few that actually matter most.
•The trick, called Quantile Rendering (Q-Render), picks Gaussians by watching how much transparency (transmittance) drops along the ray.
•This cuts the math from doing huge work per pixel (O(N·C)) to a leaner recipe (O(N + K·C)), where K is small and C is feature size.
•The method plugs into a 3D neural network (GS-Net) that predicts high‑dimensional features for Gaussians and learns from 2D CLIP features.
•On ScanNet and LeRF-OVS, it achieves higher accuracy than past methods while being fast enough for real‑time use.
•With 512‑D features, Q-Render reaches about 43.7× speedup in rendering compared to a common baseline.
•It stays accurate because it approximates the original renderer using evenly spaced ‘checkpoints’ in transparency, with error shrinking as 1/K.
•It uses less memory than other 512‑D methods since it avoids big per-view caches.
•Limits include choosing a good K, depending on input 3D Gaussians’ quality, and sensitivity to the backbone network and voxel grid size.

Why This Research Matters

Open-vocabulary 3D understanding lets you ask for any object in a scene using plain language and get instant answers. This is vital for AR glasses that highlight items you ask for, or home robots that must find tools safely and quickly. Creative tools can tag and edit parts of 3D scenes by name (like 'make the couch blue') without manual labeling. Because the method keeps full 512‑D richness and runs in real time, it bridges the gap between powerful language models and high-speed 3D graphics. Lower memory and higher speed also make it more practical on limited hardware. In short, it helps machines both see and understand 3D worlds fast, accurately, and flexibly.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how in a crowded room, lots of people are talking, but only a few voices actually reach your ears clearly? If you try to listen to everyone at once, you get overwhelmed, and you still miss the important parts.

🥬 Filling (The Actual Concept):

What it is: This paper is about making 3D computers (that draw scenes with many fuzzy blobs called Gaussians) learn and render rich, 512‑dimensional features quickly, without losing the special details that connect pictures with words.
How it works (story of the field):
1. The world before: 3D Gaussian Splatting (3D‑GS) made 3D scenes fast and pretty for RGB images. People then wanted more than colors—they wanted meaning: “where is the chair?” “find the mug.” This needs high‑dimensional, language‑aligned features like CLIP’s 512‑D vectors.
2. The problem: The usual rendering method (volume rendering) mixes all Gaussians that a camera ray touches. That is okay for 3 color channels but becomes very slow for 512 features per pixel and per Gaussian.
3. Failed attempts: Many tried squishing 512‑D features into tiny codebooks (like 3‑D or 6‑D). That runs fast but throws away information, so open‑vocabulary segmentation gets worse.
4. The gap: We needed a way to keep rich 512‑D signals yet avoid mixing every Gaussian on the ray.
5. The solution’s idea: Maybe not all Gaussians matter. If we only blend the ones that actually change the ray’s transparency the most, we can keep quality and skip the rest.
Why it matters: Without this, 3D understanding either runs too slowly (full 512‑D blending) or loses meaning (compression). That hurts apps like AR search (“highlight the tea cup”), robot navigation, and editing 3D scenes with natural language.

🍞 Bottom Bread (Anchor): Imagine sorting your backpack by only checking spots where you know you put big items (like the lunchbox corners), instead of digging through every pocket. You find what matters faster and still grab the important stuff.

Now, let’s unpack the key building blocks in the order you need them.

🍞 3D Gaussian Splatting (3D‑GS)

Hook: Imagine building a 3D world from puffs of colored fog. Each puff is soft, a bit see‑through, and has a place and shape.
The Concept: 3D‑GS represents a scene as many soft 3D blobs (Gaussians) with position, size, rotation, color, and opacity. A renderer projects and blends them to make images. How it works:
1. Store lots of Gaussians with centers, sizes, rotations, colors, and opacities.
2. For each camera pixel, find which Gaussians it intersects.
3. Blend them in order of depth to get the pixel result. Why it matters: Without Gaussians, we’d struggle to get both speed and quality in real‑time 3D rendering.
Anchor: Think of a room made of many tiny colored mist balls. A camera looks through them and blends their colors to get the final picture.

🍞 Volume Rendering

Hook: You know how stacking several sheets of colored cellophane changes what you see? Each sheet adds a tint and blocks some light.
The Concept: Volume rendering blends contributions from all intersecting Gaussians along a ray using their opacity. How it works: 1) Sort Gaussians by depth, 2) compute how much light remains (transmittance), 3) alpha‑blend each one, 4) repeat until done. Why it matters: It’s accurate, but for 512‑D features, blending everything is slow.
Anchor: Like looking through layered sunglasses; every layer slightly darkens and colors your view.

🍞 Transmittance Profile

Hook: Imagine how much light gets through as you stack more sunglasses.
The Concept: Transmittance tells how much light remains after passing each Gaussian; it decreases along the ray. How it works: Start at 1 (all light), multiply by (1 − opacity) for each Gaussian, and it steps down toward 0. Why it matters: It shows which Gaussians truly affect the pixel; big drops mean big influence.
Anchor: If one super‑dark sunglass makes the view much dimmer, that layer really matters.

🍞 Alpha‑Blending

Hook: Mixing watercolor paints on paper produces a combined color.
The Concept: Alpha‑blending mixes each Gaussian’s contribution with the remaining light. How it works: weight = transmittance × opacity; add weight × feature/color; update transmittance. Why it matters: It decides how much each Gaussian counts in the final pixel.
Anchor: Dripping paint into clear water: the first drops color lots of water; later drops affect less as the water gets darker.

🍞 CLIP Features (512‑D)

Hook: You know how you can connect a word like “mug” to any picture of a mug?
The Concept: CLIP maps images and text into the same 512‑D space so matching words with regions is easy. How it works: A vision encoder extracts a 512‑D vector for an image patch; a text encoder does the same for a word; close vectors mean a match. Why it matters: This lets a 3D scene answer open‑ended queries like “find the toaster.”
Anchor: It’s like a bilingual dictionary for pictures and words: “dog” and a dog photo land in the same place.

🍞 Contrastive Loss

Hook: If you practice by comparing right vs. wrong answers, you learn faster.
The Concept: Contrastive loss pulls matching pairs together and pushes non‑matches apart in feature space. How it works: Compute cosine similarity between rendered features and the correct CLIP feature; reward high match and penalize mismatches. Why it matters: It teaches 3D features to align with text/image meaning.
Anchor: Like training by matching flashcards: “apple” goes with the red fruit picture, not the chair.

02Core Idea

🍞 Top Bread (Hook): Imagine you’re scanning a bookshelf for one special book. You don’t open every book. You just check the few shelf spots where the spine colors suddenly change—that’s where important switches happen.

🥬 Filling (The Actual Concept):

What it is: Quantile Rendering (Q‑Render) is a way to render high‑dimensional features by only blending the Gaussians that cause the biggest drops in transparency along each ray.
How it works (recipe):
1. March along the ray and track transmittance (how much light remains).
2. Split the 0→1 transmittance range into K+1 equal slices (quantiles).
3. Whenever the running transmittance crosses a slice boundary, pick that Gaussian as a “quantile Gaussian.”
4. Alpha‑blend only these K Gaussians’ features.
5. Normalize the result to account for the skipped tail so the total contribution matches the original scale.
6. Repeat per pixel to get the 2D feature map.
Why it matters: This keeps the most influential Gaussians and skips the rest, massively cutting computation while closely matching full volume rendering. The approximation error shrinks like 1/K.

🍞 Bottom Bread (Anchor): Like sampling the most important beats in a song instead of listening to every second, you still get the melody while saving tons of time.

Multiple analogies (3 ways):

Crowd filter: In a parade, only the drummers change the rhythm a lot. Q‑Render listens at the drum moments (big transmittance drops) instead of every marcher.
Rain gauge: Instead of measuring every raindrop, check fixed water levels in a tube. When the water crosses a mark, record that event. That’s sampling by quantiles.
Bookmarks: Instead of reading every page, you jump to bookmarks placed at equal progress points. You still grasp the story arc with far fewer stops.

Before vs After:

Before: Volume rendering blends all intersecting Gaussians: complexity O(N·C) for N Gaussians and C‑dimensional features. Fast for RGB (C=3), too slow for CLIP (C=512).
After: Q‑Render scans once to find K events, then blends only K Gaussians: O(N + K·C). With small K (e.g., 10–50), it’s dramatically faster, enabling ~43.7× speedup for 512‑D rendering in tests while keeping accuracy.

Why it works (intuition, not equations):

The final pixel is the integral of contributions over how much light remains. If you sample at evenly spaced transmittance checkpoints (quantiles), you hit the key places where the ray actually changes. That’s a principled approximation, like a Riemann sum, whose error drops as you add more checkpoints (larger K).
Also, the 3D network tends to predict smooth features over space, so we don’t need dense per‑ray sampling to get stable feature maps.

Building Blocks (each with a mini sandwich):

🍞 Transmittance‑domain sampling

What: Sample by equal steps in remaining light, not equal steps in distance or by sorting top‑K opacities.
How: Partition the [0,1] transmittance into K+1 equal bins; pick the Gaussian at each boundary crossing.
Why: Guarantees you catch the biggest influence changes without expensive sorting.
Anchor: Hiking by checking altitude every fixed drop (100 m) instead of every minute.

🍞 Sparse alpha‑blending

What: Blend only K chosen Gaussians.
How: Compute each picked Gaussian’s weight = current transmittance × opacity; accumulate weighted features.
Why: Saves the heavy C‑dimensional math on the many small contributors.
Anchor: In a recipe, measure the main spices; a pinch of salt doesn’t need a separate scale reading.

🍞 Normalization step

What: Correct for the skipped tail so totals match the full renderer’s scale.
How: Divide by (1 − remaining transmittance after the K picks).
Why: Keeps brightness/feature magnitude consistent with the full method.
Anchor: After sampling only some slices of a cake, scale your estimate to match the whole cake size.

🍞 GS‑Net (Gaussian Splatting Network)

Hook: Like a coach teaching every player (Gaussian) what role to play.
What: A 3D neural network that predicts high‑dimensional features for Gaussians across scenes.
How: Voxelize centers for efficiency, run a 3D backbone (MinkUNet or PTv3), then de‑voxelize back to per‑Gaussian features; Q‑Render turns these into 2D feature maps.
Why: Moves beyond per‑scene memorization; generalizes across scenes and learns smooth, robust features.
Anchor: Turn a messy orchestra into a trained band where every instrument (Gaussian) plays in harmony.

🍞 Aligning with CLIP via contrastive loss

What: Teach 3D features to match CLIP features of 2D masks.
How: Use Grounded‑SAM to get masks, extract CLIP 512‑D features, render 3D features with Q‑Render, and pull matches together with contrastive loss.
Why: Enables open‑vocabulary segmentation: type any word and find it in 3D.
Anchor: Practice flashcard matching so “toaster” in text finds the toaster region in the scene.

03Methodology

At a high level: Optimized 3D Gaussians → 3D neural network predicts per‑Gaussian features → Q‑Render along camera rays → 2D feature maps → Contrastive loss vs. CLIP → Backprop to improve 3D features.

Step‑by‑step (with why and examples):

Inputs and preprocessing

What happens: Start with a scene represented by many Gaussians (position, size, rotation, color, opacity). Their centers are voxelized into a sparse 3D grid so a 3D backbone can process them efficiently.
Why it exists: Directly processing millions of overlapping Gaussians is slow. Voxelization groups them into a compact structure that 3D CNNs or sparse transformers handle well.
Example: A living room scene becomes a sparse 3D grid (say 5 cm voxels) containing only the spots where Gaussians live.

3D backbone predicts features (GS‑Net)

What happens: A 3D network (MinkUNet or PTv3) takes voxel features, learns spatial context, and outputs voxel‑level embeddings. These are de‑voxelized back to per‑Gaussian 512‑D features.
Why it exists: We want generalizable, smooth high‑dimensional features, not per‑scene codebooks. The 3D net learns across many scenes.
Example: The network learns that clusters near tabletops often align with “mug” or “plate,” producing feature vectors that match CLIP semantics.

Collect 2D supervision from CLIP

What happens: For each training image, Grounded‑SAM proposes masks; CLIP’s vision encoder gives a 512‑D vector per mask.
Why it exists: We need rich, language‑aligned targets to teach the 3D features what “mug” or “toaster” looks like.
Example: A mask around a dark cup yields a 512‑D feature close to the text “dark cup.”

Quantile Rendering (the heart)

What happens (algorithm in plain words): a. Rasterize/sort the Gaussians intersecting a pixel’s ray (like 3D‑GS does). b. Track transmittance T starting at 1. Split [0,1] into K+1 equal bins. c. Walk through the Gaussians in depth order. When T would cross the next bin boundary, mark this Gaussian as a quantile pick. d. For each pick, alpha‑blend its 512‑D feature with weight = current T × opacity. e. After K picks (or finishing), normalize by 1 − remaining T.
Why it exists: This avoids blending every Gaussian’s 512‑D vector. Instead, it samples the most influential moments along the ray. No expensive sorting of top‑K opacities is needed beyond the usual depth order, and complexity becomes O(N + K·C), not O(N·C).
Example (numbers): If a pixel sees N=200 Gaussians and C=512, volume rendering does 200×512 operations per pixel for feature blending. With K=40, Q‑Render does about 200 checks plus 40×512 blends—much less.

Compute the contrastive loss

What happens: Compare the rendered feature vector for the masked pixel region with the CLIP feature of that mask using cosine similarity in a contrastive loss.
Why it exists: It pulls the 3D features toward the correct semantics and pushes away mismatches.
Example: The “toaster” mask’s CLIP vector gets higher similarity to the rendered features over training, while similarity to “plant” goes down.

Backpropagation and training

What happens: Gradients flow from the contrastive loss through Q‑Render back into the 3D backbone, updating its weights to improve per‑Gaussian features.
Why it exists: End‑to‑end learning lets GS‑Net become a generalizable feature predictor for Gaussians.
Example: Over epochs, the network learns consistent features for mugs across different scenes.

Secret sauce (what’s clever):

Transmittance‑domain sampling: Sampling at equal steps of remaining light directly targets the places that change the pixel most—an efficient Riemann‑sum‑like approximation of full volume rendering with an error that shrinks as 1/K.
Sorting avoidance: Unlike top‑K selection (which sorts by importance and costs O(N log K) before blending), Q‑Render just watches T cross fixed boundaries—simple scans, fewer per‑ray costs.
Synergy with smooth features: Because the 3D network predicts spatially smooth embeddings, you don’t need dense per‑ray samples; sparse, well‑placed samples are enough.
Memory efficiency: No big per‑view caches; even with 512‑D, memory stays much lower than some competitors.

Concrete mini‑examples:

Single ray with K=3: Suppose T goes 1.0 → 0.75 → 0.50 → 0.25 → 0.05 as you pass five Gaussians. Q‑Render picks the ones near T=0.75, 0.50, 0.25 (three checkpoints), blends just those, and normalizes.
RGB test: Swapping full renderer with Q‑Render (K small) keeps PSNR close, showing the approximation is faithful even for color.

What breaks without each step:

No voxelization: The 3D network becomes too slow/memory heavy on raw Gaussians.
No quantile selection: You pay O(N·C) and lose real‑time speed for 512‑D features.
No normalization: The rendered magnitude drifts, hurting alignment with CLIP.
No contrastive loss: Features don’t learn to match language; open‑vocabulary fails.

Complexity and hyperparameters:

Complexity: Volume O(N·C). Top‑K O(N log K + K·C). Q‑Render O(N + K·C).
K choice: Accuracy improves and stabilizes around K≥10; best often near K=40 in tests. Too small K may under‑sample; too large K costs more time.
Grid size: Mid‑sized voxels (e.g., 5 cm) performed best; very small grids reduce receptive field overlap and hurt accuracy.

04Experiments & Results

The Test (what and why):

Benchmarks: ScanNetv2 (indoor) and LeRF‑OVS (scenes with open‑vocabulary labels).
Task: Open‑vocabulary 3D semantic segmentation. Type a category; the system labels matching parts in 3D.
Metrics: mIoU (mean Intersection over Union) and mAcc (mean Accuracy)—higher is better.
Why these: They measure how well predictions overlap with ground truth and how often labels are right across classes.

The Competition (baselines):

LangSplat and OpenGaussian: compress features to low dimensions (3D or 6D), fast but lossy.
Dr.Splat: keeps 512‑D but uses heavier rendering and caching.

The Scoreboard (with context):

ScanNet (19 classes): GS‑Net (MinkUNet backbone) reaches mIoU ≈ 50.75% and mAcc ≈ 62.00%. That’s like moving from a class average of around a B‑ to a solid A/B+ compared to prior art (e.g., OpenGaussian around 22.60% mIoU).
LeRF‑OVS: With 512‑D features, GS‑Net hits mIoU ≈ 45.85% and mAcc ≈ 56.9%, outperforming methods using compressed features.
Speed: Rendering 512‑D feature maps achieves up to ~43.7× speedup versus a common baseline implementation that loops 512 times. Frames‑per‑second (FPS) is real‑time with Q‑Render, while top‑K slows sharply as K grows.
Memory: Peak memory for a 100‑frame test is ~27.18 GB with Q‑Render (K=40) versus >61 GB for a competitor using per‑view caches at 512‑D.

Surprising/Notable findings:

Q‑Render can match or slightly beat full volume rendering in mIoU at certain K (e.g., K≈40). Likely because focusing on influential transmittance changes and the network’s smoothness reduce noise from tiny contributors.
PTv3 vs MinkUNet: PTv3 can overfit single scenes strongly, but MinkUNet generalizes better across many scenes. Grid size also matters: around 5 cm worked best; too tiny grids break receptive field sharing.
Top‑K sampling underperforms when K is small: It doesn’t track the original transmittance profile well, leading to bigger quality drops than Q‑Render.
Robustness: Adding mild noise to opacity barely hurts performance; extreme noise degrades results, as expected.
RGB check: Replacing the renderer with Q‑Render for color images slightly lowers PSNR but remains close, showing the approximation is faithful beyond features.

Ablations (what changed and why it matters):

Renderer swap: Volume vs Top‑K vs Q‑Render—Q‑Render gave the best speed‑accuracy trade‑off. Top‑K needs sorting and slows with larger K.
K sweep: Accuracy stabilizes around K≥10; best near K≈40 in tests. Mismatch between training K and inference K reduces performance—an argument for adaptive K in the future.
Grid size sweep: Mid‑range voxels (≈5 cm) gave highest mIoU; too small grids reduce effective context for the backbone and harm performance.
Input 3D‑GS quality: Better Gaussians (with depth supervision) improved downstream segmentation notably, highlighting dependence on input geometry.

Takeaway from numbers:

You can keep 512‑D CLIP richness and still render in real time by sampling at transmittance quantiles. That lifts both accuracy and practicality compared to compress‑then‑render pipelines.

05Discussion & Limitations

Limitations:

Fixed K: The best number of quantile picks depends on the scene/ray complexity, but K is fixed here. When inference K differs from training K, accuracy dips.
Input quality: If the 3D Gaussians are noisy or mis‑scaled, performance drops; better geometry (e.g., with depth supervision) helps a lot.
Backbone sensitivity: MinkUNet and PTv3 are sensitive to voxel grid size and can overfit. Point‑based backbones struggled with dense Gaussian clusters.

Required resources:

Training used powerful GPUs (e.g., A100‑80GB) with multi‑GPU batches. For high‑res, 512‑D, and many scenes, you need solid VRAM. Still, Q‑Render’s memory use is far lower than 512‑D baselines with big caches.

When NOT to use:

If you must exactly replicate the full volume renderer’s per‑ray behavior at very small K—then Q‑Render may approximate too coarsely.
If your Gaussians or opacities are extremely corrupted (far beyond realistic noise), transmittance checkpoints lose meaning.
If your pipeline already compresses features to tiny dimensions and values exact RGB matching over semantics, volume rendering might be sufficient.

Open questions:

Adaptive K: Can we choose K per ray efficiently, without a second pass or heavy heads? Early trials slowed FPS too much.
Gaussian‑aware backbones: Can we avoid voxelization or design operators that handle anisotropic Gaussians directly?
Generalizable 3D‑GS: If future methods skip per‑scene optimization entirely, how does GS‑Net change to plug in cleanly?
Better transmittance models: Could stratified or learned quantile placements improve accuracy without extra passes?

06Conclusion & Future Work

3‑Sentence Summary:

This paper introduces Quantile Rendering (Q‑Render), which samples at fixed steps of remaining light to pick only the most influential Gaussians along each ray.
Combined with GS‑Net, a 3D backbone that predicts 512‑D features for Gaussians, it aligns with CLIP via contrastive learning and renders feature maps efficiently.
The result is real‑time, high‑fidelity open‑vocabulary 3D segmentation that outperforms prior work while preserving rich semantic detail.

Main Achievement:

Turning high‑dimensional feature rendering from an O(N·C) bottleneck into an O(N + K·C) solution by sampling in the transmittance domain—achieving up to ~43.7× speedups for 512‑D with state‑of‑the‑art accuracy.

Future Directions:

Adaptive K selection with negligible overhead, Gaussian‑native network layers that reduce or remove voxelization, and tighter integration with generalizable 3D‑GS that avoid per‑scene optimization.

Why Remember This:

Q‑Render shows that “where you sample” matters as much as “how much you sample.” By watching transparency, not just space or top‑K, you keep the meaning (512‑D richness) and gain speed. That shift unlocks practical, language‑aware 3D understanding for AR, robotics, and content creation—fast, detailed, and ready for open‑ended queries.

Practical Applications

•AR object finding: Say 'highlight the mug' and instantly see it outlined in your glasses.
•Home robotics: Let a robot locate 'the red bowl on the counter' with open-vocabulary queries.
•3D content editing: Select 'sofa' in a scene and change its color or texture with one click.
•Retail analytics: In a 3D store scan, count and track 'bottles of water' or 'cereal boxes' by name.
•Safety checks: In factories, quickly detect 'no helmet' or 'blocked exit' in 3D scans.
•Cultural heritage: Search museum scans for 'vase' or 'inscription' without handcrafted labels.
•Game development: Rapidly tag and manipulate game assets in 3D scenes by natural language.
•Digital twins: Query large facility models for 'valves' or 'pressure gauges' to assist maintenance.
•Smart navigation: Help drones or robots understand 'doorway' or 'staircase' to plan motions.
•Education: Interactive 3D lessons where students ask for 'planet model' or 'volcano vent' and see it highlighted.

Version: 1