MatSpray: Fusing 2D Material World Knowledge on 3D Geometry

Philipp Langsteiner; Jan-Niklas Dihlmann; Hendrik P. A. Lensch

MatSpray: Fusing 2D Material World Knowledge on 3D Geometry

Intermediate

Philipp Langsteiner, Jan-Niklas Dihlmann, Hendrik P. A. Lensch12/20/2025

arXiv PDF

Key Summary

•MatSpray turns 2D guesses about what materials look like (color, shininess, metal) into a clean 3D model you can relight realistically.
•It uses 3D Gaussian Splatting to build the shape, and then sprays in 2D material knowledge from diffusion models onto the 3D points.
•A tiny neural network called the Neural Merger uses softmax “voting” to pick the most consistent material values across all camera views.
•Two kinds of supervision keep things honest: matching the 2D material predictions and matching real photos using PBR rendering.
•This avoids baking shadows and highlights into the materials, so the object looks right under new lights.
•Compared to strong baselines (Extended R3DGS and IRGS), MatSpray relights more accurately and produces cleaner base color, roughness, and metallic maps.
•It is about 3.5× faster than IRGS on the Navi dataset while keeping high visual quality.
•It works well on shiny and metallic objects, where many methods struggle.
•Results are limited by the quality of the chosen 2D diffusion material predictor and by the underlying geometry.
•The method is plug-and-play: you can swap in better 2D diffusion material models as they arrive.

Why This Research Matters

MatSpray makes it much faster and easier to build 3D objects that look right under any lighting, which saves time and money in games, movies, AR, and VR. It converts powerful 2D material knowledge into a physically meaningful 3D form, avoiding the common problem of baked‑in shadows or glares. This means creators can change the light or move objects into new scenes and still trust the results. The method is plug‑and‑play: as 2D diffusion material models improve, MatSpray instantly benefits. It is also efficient, running about 3.5× faster than a strong baseline, which matters for production pipelines. Finally, it handles shiny and metallic surfaces well, a notorious challenge for many systems. In short, it brings realism, speed, and reliability to digital asset creation.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how decorating a cake is way faster if you buy ready-made sprinkles instead of making each one from scratch? Game and film artists feel the same about building 3D objects that look real under any light — it takes forever to hand‑craft every tiny material detail.

🥬 The Concept: Before this work, AI could build the shape of a 3D object and make it look good from a few photos, but it often mixed the lighting with the surface itself. That means shadows and highlights got “painted” into the texture, so the object looked wrong under new lights. Meanwhile, 2D diffusion models learned a lot about materials just from images and could predict PBR maps (base color, roughness, metallic), but only per image and not consistently across different views. How it works (the scene before MatSpray):

Take many photos of an object.
Reconstruct 3D appearance (NeRF or 3D Gaussians) — great at matching the training photos.
Try to edit lighting — it breaks because textures contain baked-in shadows and glares. Why it matters: Without separating materials from lighting, relighting looks fake. Artists spend hours fixing this, slowing movie/game production.

🍞 Anchor: Imagine photographing a shiny kettle. If the reflection of a window gets painted into the color map, turning off the window light won’t remove the reflection. That’s exactly the problem older methods had.

🍞 Hook: Imagine you and your friends guess what a mystery object is made of from different angles. Each friend has a good guess, but your answers don’t match because everyone saw a different glare.

🥬 The Concept: 2D diffusion material predictors act like those friends: they can guess base color, roughness, and metallic from each view, but their guesses differ because lighting and viewing angle change. How it works:

For each photo, a 2D model predicts PBR maps.
Those maps are great “world knowledge,” but inconsistent across photos.
Simply averaging the maps in 3D blurs details and keeps lighting artifacts. Why it matters: If we can fuse those smart 2D guesses into 3D consistently, we get relightable materials fast.

🍞 Anchor: If three kids color the same car from different angles, one might draw a bright white stripe (a reflection), another won’t. Averaging their drawings keeps a faint white stripe baked into the paint — not the real car color.

🍞 Hook: Think of building a clay statue from tiny soft beads. Each bead carries position, size, and color; together they make a smooth shape.

🥬 The Concept: 3D Gaussian Splatting represents a 3D object as a cloud of soft blobs (Gaussians) that can be rendered quickly and smoothly. How it works:

Fit many 3D Gaussians to match photos.
Each Gaussian has position, size, opacity, and a normal.
Rendering blends their contributions along camera rays. Why it matters: This gives fast, detailed geometry and is a great canvas to attach per‑point materials.

🍞 Anchor: It’s like making a 3D teddy bear with thousands of fluffy pom‑poms; you can color and shine each pom‑pom.

🍞 Hook: A photographer asks, “How does light bounce on this surface?” to make photos look real.

🥬 The Concept: Physically Based Rendering (PBR) simulates real light behavior so materials look right under any light. How it works:

Use material parameters: base color (albedo), roughness, and metallic.
Combine them with surface normals and a light source (environment map).
Compute the final pixel color as light reflects off the surface. Why it matters: Without PBR, shiny things don’t shine correctly and matte things don’t look matte.

🍞 Anchor: A chrome ball under sunlight looks totally different from a chalk ball; PBR captures that difference using those material knobs.

🍞 Hook: Imagine shining a laser pointer through fog and watching how the beam fades and spreads.

🥬 The Concept: Gaussian ray tracing tells us how camera rays pass through and interact with the 3D Gaussian blobs, letting us project 2D information onto 3D points accurately. How it works:

Cast a ray from a camera pixel into 3D.
For each Gaussian it passes, compute a contribution based on the Gaussian’s opacity/shape.
Use these hits to assign that pixel’s material suggestion to the touched Gaussians. Why it matters: Without reliable 2D→3D assignment, materials land on the wrong spots or get missed.

🍞 Anchor: It’s like using a spray can through a stencil: the paint (2D map) lands only where the beam (ray) hits the right dots (Gaussians).

🍞 Hook: If five friends vote on a movie and you combine their votes smartly, you get a better choice than any single vote.

🥬 The Concept: Multi-view consistency means the object’s material should be the same no matter which photo you look from. How it works:

Gather material suggestions from all views for each 3D point.
Combine them so that shadows/highlights don’t sneak into the material.
Prefer the most plausible, cross-view-agreeing values. Why it matters: Without consistency, materials flicker or change with the camera — bad for relighting.

🍞 Anchor: A wooden table should stay “wooden” from every angle; it shouldn’t turn darker just because one photo had a shadow.

🍞 Hook: Imagine a music mixer that blends several singer tracks but only by adjusting volumes — it can’t invent new notes.

🥬 The Concept: The Neural Merger is a tiny MLP that assigns softmax weights to per-view material suggestions and blends them, instead of inventing new values. How it works:

Input: position encoding of each Gaussian + its per-view material suggestions.
Output: one weight per view, normalized by softmax (weights sum to 1).
Final material = weighted sum of suggestions. Why it matters: Without softmax, the network could “cheat,” baking lighting into materials and destabilizing environment light estimation.

🍞 Anchor: If two views say “this spot is wood” and one view says “it’s shiny from a glare,” the Neural Merger upweights the consistent wood votes and downweights the glare.

02Core Idea

🍞 Hook: Imagine repainting a toy airplane with perfect material paint: you sample paint from 2D photos (color, shininess, metalness), then carefully spray it onto the 3D model so it looks right in any room lighting.

🥬 The Concept: The key insight is to fuse 2D “world material knowledge” from diffusion models with a 3D Gaussian geometry via ray‑traced projection and a softmax‑based Neural Merger, then refine with PBR supervision so materials are consistent and relightable. How it works (at a glance):

Predict per‑view base color, roughness, and metallic maps using any 2D diffusion material model.
Reconstruct the object as 3D Gaussians (shape + normals).
Ray‑trace from each image to assign its material suggestions onto intersected Gaussians.
A tiny MLP with softmax weights blends these multi‑view suggestions into one per‑Gaussian material (no inventing, only choosing).
Train with two signals: match the 2D material maps and match the real photos via PBR rendering under an optimizable environment map. Why it matters: It keeps the wisdom of the 2D model, fixes cross‑view disagreements, and avoids baking light into the materials — so relighting looks real.

🍞 Anchor: It’s like having several smart art teachers (2D models) suggest paint mixes for each spot on a sculpture and one fair judge (Neural Merger) decides the final paint that works from all sides.

Multiple analogies:

Paint Sprayer Analogy: 2D maps are cans of specialized paint (color/roughness/metal). Gaussian ray tracing is your aiming laser, and the Neural Merger is your nozzle that balances paint from different cans based on confidence.
Orchestra Analogy: Each camera view plays its material “note.” The Neural Merger is the conductor using softmax to balance volumes; PBR supervision is the music score ensuring the final sound matches physics.
Voting System Analogy: Every view casts a vote about a point’s material. The Neural Merger tallies votes with softmax probabilities, preferring consistent voters and ignoring outliers.

Before vs After:

Before: 3D reconstructions looked fine under training lights but failed under new lights because materials had baked-in shadows and glares; 2D predictors were strong but inconsistent across views.
After: Materials are clean, consistent, and relightable; metallic regions are correctly identified; environment lighting can be optimized stably because materials aren’t cheating.

Why it works (intuition):

Constraint beats freedom: Letting a network invent any material value is risky; it can overfit to lighting. Constraining it to interpolate only among 2D suggestions keeps results plausible and stabilizes training.
Two complementary teachers: The 2D loss preserves the diffusion prior’s realism, while the PBR photometric loss aligns everything with real images and fixes small diffusion errors.
Geometry‑aware mapping: Gaussian ray tracing ensures the right 2D suggestions land on the right 3D points; supersampling avoids missing tiny Gaussians.

Building blocks (with mini sandwiches):

🍞 Hook: Think of sticker labels for a model kit. 🥬 The Concept: 2D Diffusion Material Predictions are per‑image guesses of base color, roughness, and metallic. How: a trained diffusion model outputs these maps for each photo. Why: provides strong priors, but they’re view‑inconsistent. 🍞 Anchor: One view labels a spoon as very shiny; another says medium — both are plausible per photo.
🍞 Hook: Sculpting from soft beads. 🥬 The Concept: 3D Gaussian Splatting represents the object as many soft blobs with normals. How: optimize blobs to match photos. Why: fast, detailed canvas for per-point materials. 🍞 Anchor: A cat statue made of puffy dots you can color.
🍞 Hook: A laser pointer through mist. 🥬 The Concept: Gaussian Ray Tracing assigns 2D material suggestions to the Gaussians a pixel-ray touches. How: trace rays, compute contributions, take robust medians per Gaussian and view. Why: without accurate assignment, materials blur or land wrong. 🍞 Anchor: Spray paint hits only the dots your laser passes.
🍞 Hook: Fair judge blending votes. 🥬 The Concept: Neural Merger (softmax). How: MLP outputs per-view weights, softmax normalizes them, final material is weighted sum; separate MLPs per channel help disentanglement. Why: avoids inventing values and prevents light baking. 🍞 Anchor: The judge can’t invent a new color, only choose a mix of the candidates.
🍞 Hook: Stage lighting test. 🥬 The Concept: Dual Supervision. How: (1) L1 loss to 2D maps; (2) PBR render-and-compare to photos with SSIM/L1 while optimizing an environment map. Why: keeps priors but corrects tone/lighting mismatches. 🍞 Anchor: You compare your stage set both to the designer’s sketch (2D prior) and to a rehearsal photo (PBR render).

03Methodology

At a high level: Input photos → 2D diffusion predicts PBR maps per view → 3D Gaussian Splatting builds shape/normals → Gaussian ray tracing lifts 2D maps onto Gaussians → Neural Merger fuses per‑view suggestions with softmax → Dual supervision refines materials and environment light → Output: relightable 3D materials.

Step 1: 2D Diffusion Material Prediction

What happens: For each training image, a diffusion-based material model (e.g., DiffusionRenderer) predicts base color (3 channels), roughness (1), and metallic (1). Short video batches improve local consistency.
Why this step exists: It injects “world material knowledge” learned from huge 2D datasets, giving strong priors even where lighting is tricky.
Example: A teakwood table photo gets a brownish base color map, medium roughness, near-zero metallic; a chrome kettle gets near-one metallic in metal areas and low roughness.

Step 2: 3D Gaussian Splatting Geometry and Normals

What happens: Build a 3D Gaussian cloud that explains the input photos; store positions, shapes, opacities, and normals. For specular scenes, predicted normals can guide geometry to avoid holes.
Why this step exists: You need a concrete 3D canvas with normals to attach materials for PBR shading.
Example: A birdhouse becomes tens of thousands of Gaussians capturing edges, roof curves, and smooth walls.

Step 3: 2D-to-3D Material Lifting via Gaussian Ray Tracing

What happens: For each image pixel, cast a ray and accumulate which Gaussians it intersects. Assign that pixel’s material values to those Gaussians. Within each Gaussian’s footprint per view, take a robust median to reduce outliers. Remove Gaussians never hit; use grid supersampling (e.g., 16×16 rays per pixel) so tiny Gaussians aren’t missed.
Why this step exists: It’s the projector that transfers 2D guesses to the correct 3D spots, view by view.
Example: The shiny streak seen in one photo affects only the Gaussians that ray truly hits on the kettle’s curved body, not the handle or spout.

Step 4: Neural Merger (Softmax Fusion per Gaussian)

What happens: For each Gaussian and each material channel separately (base color, roughness, metallic), feed the Gaussian’s position encoding plus all its per-view suggestions into a small MLP. The MLP outputs unnormalized weights; a softmax turns them into probabilities summing to 1. The final material equals the weighted sum of suggestions.
Why this step exists: It enforces multi-view consistency by blending only among candidate values, preventing the network from inventing new ones and from baking lighting into materials.
Example: If three views say roughness ≈ 0.2 and one view (with a glare) says 0.6, the Neural Merger downweights 0.6 and picks ≈ 0.2.

Step 5: Dual Supervision and Refinement

What happens: Render per‑Gaussian materials into 2D material maps and use an L1 loss to match the diffusion‑predicted maps (only the Neural Merger is updated by this loss). Also perform deferred PBR shading using the materials and normals to produce images; compare them to ground‑truth photos using a blended L1 and SSIM loss (e.g., λ=0.8). Jointly refine the Neural Merger and an environment map.
Why this step exists: The 2D loss keeps the solution anchored to the diffusion prior; the PBR loss corrects for tone mapping or small 2D errors and ensures the final look matches real images.
Example: If DiffusionRenderer’s base color is a bit too dark due to tone mapping, the PBR loss nudges it brighter so renders align with the photos.

Step 6: Output and Relighting

What happens: Export the 3D Gaussians with clean base color, roughness, and metallic per Gaussian (a relightable asset). Under new HDR environment maps, the object responds realistically.
Why this step exists: The goal is not just to match training photos but to look right under any new light.
Example: Put your reconstructed metal airplane in a sunset HDR — warm highlights slide correctly over its metallic body; matte decals stay dull.

The Secret Sauce:

Softmax-restricted fusion: By only interpolating among per‑view candidates, the Neural Merger avoids cheating. Without softmax, the MLP can create values that absorb shadows/highlights, causing unstable environment-light optimization and non‑physical materials.
Geometry-aware projection: Median aggregation per Gaussian/view and dense supersampling keep assignments robust, preserving detail and preventing holes.
Two teachers, one student: 2D prior supervision + PBR photometric supervision together guide the Neural Merger to be both plausible and physically faithful.

04Experiments & Results

🍞 Hook: Imagine three soccer teams competing — which one scores more and plays cleaner? We judge MatSpray the same way: how well it relights, how accurate its material maps are, and how fast it runs.

🥬 The Concept: The tests measure material-map accuracy (PSNR, SSIM, LPIPS), relighting quality under new environment maps, and runtime. Comparisons are against Extended R3DGS (supports metallic) and IRGS. How it works:

Synthetic set: 17 objects with ground‑truth materials; train on 100 images, test on 200 per object.
Real set: Navi objects with ~27 images each.
Metrics: PSNR (higher is better), SSIM (higher is better), LPIPS (lower is better). Runtime on an NVIDIA RTX 4090. Why it matters: Numbers with baselines show if MatSpray is really better, not just pretty in a few pictures.

🍞 Anchor: It’s like saying “We scored 3-1 (PSNR up), kept clean passes (SSIM up), and made fewer mistakes (LPIPS down) — and we did it faster.”

The Competition and Scoreboard:

Relighting (synthetic): MatSpray PSNR ≈ 27.28 vs. Extended R3DGS ≈ 25.48 and IRGS ≈ 24.41; SSIM 0.897 vs. 0.875 vs. 0.850; LPIPS 0.080 vs. 0.094 vs. 0.166. Context: That’s like getting an A when others get B/B−, especially noticeable on shiny objects.
Base color (synthetic): MatSpray PSNR ≈ 21.34, SSIM 0.873, LPIPS 0.125 — best overall, with notably cleaner removal of baked-in shadows vs. baselines.
Roughness (synthetic): IRGS has slightly higher PSNR (≈16.18 vs. 15.33), but MatSpray wins SSIM (0.820 vs. 0.744) and LPIPS (0.181 vs. 0.192). Translation: MatSpray preserves structures and looks better perceptually, even if a pixelwise average error is a bit higher.
Metallic (synthetic): MatSpray’s predictions align closely with ground truth; for truly non‑metal objects, it correctly outputs zeros, which yields infinite PSNR (those are excluded from the reported average). Extended R3DGS lags; IRGS doesn’t predict metallic.

Qualitative Highlights:

Shiny/metallic objects: MatSpray avoids over-bright, over‑shiny artifacts seen in Extended R3DGS and prevents washed‑out, oversmoothed geometry common in IRGS.
Material maps: MatSpray’s base color is sharp and free of shadows; roughness/metallic are stable across views compared to per‑view 2D predictions.

Runtime:

Average total: ~1,488 s (~25 min) for MatSpray vs. ~5,347 s (~89 min) for IRGS — about 3.5× faster.
Breakdown (MatSpray): diffusion predictions ~112 s, Gaussian Splatting ~131 s, normal guidance (R3DGS) ~270 s, material optimization ~975 s.

Surprising/Notable Findings:

Infinite PSNR on metallic in non‑metal scenes indicates robust zero‑metal prediction — a strong sign of correct disentanglement.
A small softmax detail makes a big difference: Without the softmax in the Neural Merger, lighting leaks into materials and quality drops (worse LPIPS/SSIM and visible artifacts).
Tone mapping in the 2D predictor can darken base color predictions; the PBR supervision helps correct this mismatch during refinement.

Mini metric sandwiches:

🍞 Hook: You know how clearer photos look sharper and less noisy? 🥬 The Concept: PSNR measures signal vs. noise — higher is better. How: compute error vs. ground truth and convert to decibels. Why: shows overall fidelity. 🍞 Anchor: A PSNR jump from ~24 to ~27 is a visible improvement.
🍞 Hook: Two puzzle pictures with the same structure feel similar. 🥬 The Concept: SSIM measures structural similarity — edges/textures matter. Why: captures perceptual structure. 🍞 Anchor: 0.897 vs. 0.850 means details are better preserved.
🍞 Hook: Human eyes spot weird textures quickly. 🥬 The Concept: LPIPS uses deep features to judge perceptual difference — lower is better. Why: correlates with what looks right. 🍞 Anchor: 0.080 vs. 0.166 is a significant perceptual win.

05Discussion & Limitations

Limitations:

Dependent on 2D predictor quality: If the diffusion model outputs biased or tone‑mapped materials, the system must correct them during refinement; severe errors can limit final quality.
Geometry sensitivity: If underlying Gaussian geometry/normals are inconsistent (e.g., tricky specular scenes), materials can inherit issues. Supersampling and normal guidance help but don’t solve everything.
Tiny/flat Gaussians: Very small or nearly flat Gaussians may be missed during ray tracing, causing sparse/missing assignments without sufficient supersampling.

Required resources:

GPU with solid memory (e.g., RTX 4090 class used in experiments), plus standard PyTorch/C++/OptiX stack.
Per‑scene optimization time in the tens of minutes; diffusion prediction time proportional to number of images.

When not to use:

Extremely sparse views or heavy occlusions where 2D predictions are inconsistent and geometry is weak.
Scenes with severe motion or changing illumination between captures, which break multi-view assumptions.
If you need perfect physical BRDFs beyond base color/roughness/metallic (e.g., anisotropy, subsurface) without extending the model.

Open questions and future avenues:

End-to-end alignment: Could a joint training loop co-adapt the 2D predictor and 3D fusion to remove tone mapping gaps entirely?
Better roughness estimation: Roughness remains tough; can priors or cross‑view highlight cues improve it further?
Smarter assignment: A projection transformer or learned correspondence module might robustly handle missed/small Gaussians.
Richer materials: Extending beyond base color/roughness/metallic (clearcoat, anisotropy, subsurface) while keeping speed.
Data scaling: How do results change with more/fewer views, or with active view selection strategies?

06Conclusion & Future Work

Three-sentence summary: MatSpray fuses per‑view 2D diffusion material predictions with a 3D Gaussian model using ray‑traced projection and a softmax‑based Neural Merger, then refines them with PBR supervision. This produces clean, multi‑view‑consistent base color, roughness, and metallic maps that relight realistically and avoid baked‑in lighting. It outperforms strong baselines in quality and speed, especially on shiny and metallic objects.

Main achievement: Showing that constraining a tiny network to blend (not invent) among 2D material suggestions — combined with geometry‑aware projection and dual supervision — reliably turns 2D world knowledge into high‑quality, relightable 3D materials.

Future directions: Jointly address tone mapping mismatches, improve roughness estimation, add more material parameters, and explore learned projection/assignment. Use the clean geometry‑material link for segmentation and intuitive editing interfaces.

Why remember this: It’s a practical, plug‑and‑play bridge between powerful 2D diffusion priors and fast 3D Gaussian representations. By turning many good per‑view guesses into one consistent 3D truth, MatSpray makes high‑quality relightable assets faster and more reliable for real production.

Practical Applications

•Speed up game and film asset creation by generating relightable materials from casual multi‑view photos.
•Create accurate product models for e‑commerce that look correct under different showroom lights.
•Build AR try‑on or visualization assets (furniture, appliances, décor) that adapt realistically to a user’s room lighting.
•Digitize museum artifacts or props with faithful materials for virtual exhibits and education.
•Generate training data for robotics or vision by producing physically consistent objects under varied illumination.
•Support VFX relighting: match on‑set props with CG inserts by aligning materials and environment maps.
•Enable quick look‑dev iterations: swap HDR environment maps and verify how materials respond without re-authoring.
•Assist 3D printing previews with realistic surface finishes before manufacturing.
•Support virtual prototyping in industrial design by testing different material finishes on the same 3D geometry.
•Provide a foundation for part segmentation and material-aware editing (e.g., repaint only metal parts) in future tools.

Version: 1