GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

Yi-Chuan Huang; Hao-Jen Chien; Chin-Yang Lin; Ying-Huan Chen; Yu-Lun Liu

GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

Intermediate

Yi-Chuan Huang, Hao-Jen Chien, Chin-Yang Lin et al.12/31/2025

arXiv PDF

Key Summary

•GaMO is a new way to rebuild 3D scenes from just a few photos by expanding each photo’s edges (outpainting) instead of inventing whole new camera views.
•By staying at the original camera spots and only widening the field of view, GaMO keeps the geometry consistent and avoids ghosting and holes.
•It uses a smart, zero-shot multi-view diffusion process guided by rough 3D cues (from 3D Gaussian Splatting) to fill in missing areas around each image.
•A special mask guides where to add new content, and the method blends that content in steps while resampling noise to make seams smooth.
•GaMO runs fast: under 10 minutes end-to-end for typical 6-view indoor scenes, about 25× faster than strong diffusion baselines.
•On Replica and ScanNet++, it beats prior methods on PSNR, SSIM, and LPIPS, and produces cleaner, more complete reconstructions.
•Adding more invented novel views can actually hurt quality; GaMO’s outpainting consistently improves geometry and appearance.
•No fine-tuning of the diffusion backbone is needed; the method works in a zero-shot way using pre-trained multi-view diffusion.
•Limitations include not recovering parts that no camera ever saw and sensitivity to poorly spaced input views.

Why This Research Matters

GaMO makes high-quality 3D capture possible from just a few photos, cutting cost and time for real-world scanning. It produces cleaner virtual tours and AR/VR scenes without distracting holes, ghosting, or warped walls. Faster reconstructions help real estate, construction, and retail teams update digital twins quickly. Robots and drones benefit from more trustworthy 3D maps built from limited viewpoints. Creators can design immersive spaces without spending hours planning and rendering extra views. Overall, it trades risky hallucination for careful, geometry-aware expansion, leading to dependable results.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how finishing a big jigsaw puzzle is easy if you have most of the pieces, but really tough if you only have a few? 3D reconstruction from photos is just like that: the more views you have, the easier it is to build the full 3D scene.

🥬 The Concept (3D Reconstruction): What it is: Turning several 2D photos of a scene into a complete 3D model you can look around. How it works (recipe): 1) Take photos from different angles, 2) figure out where the cameras were, 3) estimate where objects are in 3D, and 4) render new views. Why it matters: Without reliable 3D reconstructions, virtual tours look broken, AR apps misplace objects, and robots can’t navigate safely. 🍞 Anchor: Think of a real-estate virtual tour. If the 3D is wrong, walls might float and doors won’t line up.

🍞 Hook: Imagine you took only three photos of your room—front, left, and right. There will be big gaps (like behind the couch) that none of the photos see.

🥬 The Problem: With only a few input views, classic methods leave holes (missing areas), ghosting (double edges), and bent or inconsistent geometry. Researchers tried three main ideas: - Regularization (telling the model to keep things smooth), - Semantic or large-model priors (using general knowledge about scenes), - Geometric constraints (forcing math rules about cameras and rays). These helped, but couldn’t reliably fill parts no photo covered. Why it matters: Apps break when ceilings have holes, shelves float, or edges shimmer as you move the camera. 🍞 Anchor: Watching a tour where a table duplicates as you walk around is distracting and untrustworthy.

🍞 Hook: You know how people sometimes draw extra frames between two movie frames to make motion smoother? Diffusion models can do a similar job by inventing new images from in-between or new viewpoints.

🥬 The Concept (Diffusion for Novel Views): What it is: Use diffusion models to generate brand-new camera views to add more training data. How it works: 1) Sample new camera poses, 2) generate images from those poses with a multi-view diffusion model, 3) train the 3D model on both real and generated views. Why it matters: More angles should mean better coverage. But if generated views disagree with the real ones, training gets confused and quality drops. 🍞 Anchor: In practice, adding many invented views often made reconstructions worse—more blur and misalignments showed up.

🍞 Hook: Imagine extending the edges of your photo by painting more of the same room onto the sides, like zooming out without moving the camera.

🥬 The Gap (Before vs. After): Before: Methods generated totally new viewpoints, leading to inconsistencies and long, complex pipelines (planning camera paths, many frames). After (this paper’s idea): Keep the same camera but widen its field of view (FOV). That preserves alignment with what’s already there and fills the missing edges. Why it matters: You get broader coverage without breaking geometry or spending hours on planning and rendering. 🍞 Anchor: Instead of inventing a new camera angle behind the couch, GaMO extends each original photo outward to include the edges near the couch, which is easier to trust and align.

🍞 Hook: Think of putting masking tape where the image is missing and carefully airbrushing new content that matches the room.

🥬 Failed Attempts vs. What’s Missing: Regularization can oversmooth, semantic priors can hallucinate wrong shapes, and novel views can fight with each other. What was missing was a way to expand what we already trust (the original photos) while staying consistent with rough 3D structure. 🍞 Anchor: GaMO’s outpainting approach chooses to carefully grow each trusted view outward—and uses a rough 3D model as guardrails—so the added pixels ‘snap’ to the right places.

🍞 Hook: Why should anyone care? Because this is how we get better AR/VR, safer robot navigation, quicker property scans, and more reliable digital twins.

🥬 Real Stakes: Faster, cleaner, and more consistent reconstructions from fewer photos reduce time and cost in real-world scanning (homes, construction, retail) and make immersive experiences feel solid. Without this, you get slow, glitchy, and untrustworthy 3D. 🍞 Anchor: A realtor can scan a room with just a handful of shots and still get a convincing virtual tour in minutes, not hours.

02Core Idea

🍞 Hook: You know how it’s easier to color outside the lines by gently extending what’s already there, rather than starting a brand-new drawing from scratch?

🥬 Aha! Moment (one sentence): Instead of inventing whole new viewpoints, widen each original photo’s field of view with geometry-aware, multi-view diffusion outpainting so everything stays aligned and complete.

Multiple Analogies:

Panorama edges: Stitching a panorama works best when you extend from what you’ve already captured; GaMO extends each original view outward. 2) Jigsaw borders: First finish the edges of the puzzle; outpaint around each view to form a sturdy frame. 3) Stage lights: Keep the spotlight in the same place but widen the beam to reveal more of the scene without moving the light.

Before vs. After:

Before: Generate many novel views from new camera poses; risk disagreements among views; takes time to plan and render; can cause ghosting and bent geometry. - After: Outpaint in-place from existing cameras; geometry stays consistent; fewer steps and faster; fewer holes and edge blur.

Why It Works (intuition):

Staying put avoids the hardest alignment: you don’t ask the model to match far-away camera positions. - A rough 3D ‘skeleton’ (coarse 3D Gaussian Splatting render and opacity mask) tells the diffusion model where content should exist and where it’s missing. - A mask blends generated pixels into the old ones, and a gentle schedule with noise resampling smooths seams so the new edges look natural. - Multi-view conditioning makes different input photos ‘agree’ as they guide the outpainting, locking in cross-view consistency.

Building Blocks (each in Sandwich style):

🍞 You know how a photo editor can tell which areas are empty or transparent? 🥬 Opacity Masking: What it is: A map that says where the current 3D guess is weak or missing. How it works: Render a coarse 3D model, compute how much stuff blocks each pixel (opacity), and mark low-opacity zones as ‘needs filling’. Why it matters: Without it, the model might change good pixels or miss the empty edges. 🍞 Anchor: The mask flags the blank margins that need new room content.

🍞 Imagine using several friends’ photos of the same room to agree on where furniture is. 🥬 Multi-view Conditioning: What it is: Feeding the diffusion model signals from all input views and camera rays so it knows how views relate in 3D. How it works: Encode camera rays (Plücker), warp features toward the widened view, and keep the original center pixels untouched. Why it matters: Without it, each outpainted view might tell a different story, causing cross-view disagreements later. 🍞 Anchor: It’s like aligning everyone’s sketches before painting the borders.

🍞 Think of a faint pencil sketch under your painting that keeps your brushstrokes on track. 🥬 Coarse 3D Prior (from 3D Gaussian Splatting): What it is: A quick, rough 3D rendering of the scene used as a guide. How it works: Build a coarse 3D model from sparse photos, render a wider-FOV image and an opacity map. Why it matters: Without a skeleton, the outpainted pixels could look plausible yet float in the wrong place. 🍞 Anchor: The coarse render hints where walls and tables belong as you extend the image.

🍞 Picture taping a stencil over the edges you’re adding and lifting it a bit at a time. 🥬 Mask Latent Blending: What it is: During denoising, combine the generated latent with the coarse latent only in the masked zones. How it works: At selected steps, add matching noise to the coarse latent and blend by a hard mask. Why it matters: Without blending, the model can drift and create seams or misalignments. 🍞 Anchor: The stencil ensures the new paint only goes where needed, at the right times.

🍞 Imagine sanding the border between old and new paint so you can’t see the seam. 🥬 Iterative Mask Scheduling + Noise Resampling: What it is: Start with a bigger masked border early, then shrink it each time; after each blend, re-add noise and denoise to smooth transitions. How it works: Multi-step blending (early/mid/late) with progressively smaller masks; resample noise to erase hard edges. Why it matters: Without it, edges look harsh or smeared, and structure gets washed out. 🍞 Anchor: The boundary vanishes; the old and new pixels feel like one image.

03Methodology

At a high level: Sparse photos + camera poses → Coarse 3D guess (opacity + rough color) → Geometry-aware multi-view diffusion outpainting (widen FOV) → Train/refine 3D Gaussian Splatting on both original and outpainted views → Clean novel views.

Step 1: Coarse 3D Initialization

What happens: Use DUSt3R to get a point cloud, train a quick coarse 3D Gaussian Splatting (3DGS) model, then render (a) a rough wider-FOV image and (b) an opacity map. - Why it exists: Gives structure hints (where walls/furniture likely are) and a map of missing regions; otherwise, the diffusion might guess off-geometry. - Example: With 6 living-room photos, the coarse render shows the couch and wall positions; the opacity map flags empty margins that need filling.

Step 2: Geometry-aware Multi-view Diffusion Outpainting 2a) Multi-view Conditioning

What happens: - Camera rays: Encode each view’s rays (Plücker) so the model knows how pixels align across views. - Warped features: Warp input images and geometry maps to the target widened camera; keep the original center pixels crisp by inserting downscaled true content there. - Latent setup: Encode inputs to clean latents; initialize a noisy latent for the outpainted target. Feed all conditions to the pre-trained multi-view diffusion model (zero-shot). - Why it exists: Locks views together in 3D and preserves trusted center pixels. Without it, outpainted edges can disagree with the middle or with other views. - Example data: A desk corner appears in two photos; warping and rays tell the model those pixels refer to the same 3D corner when adding border content.

2b) Mask Latent Blending (the secret sauce)

What happens: At chosen denoising steps (early/middle/late), blend the current generated latent with the coarse-3D latent, but only where the opacity mask says we need content. Use a hard mask (crisp boundary) and first add matching noise to the coarse latent so both are at the same denoising level. - Why it exists: If you don’t blend, new edges can drift from the coarse structure. If you blend at every step or everywhere, you over-constrain and blur details. The selected steps balance freedom and guidance. - Example: At 70%, 50%, and 30% of the denoising schedule, the stencil lets in just enough structure from the coarse image to line up the new wall edge.

2c) Iterative Mask Scheduling + Noise Resampling

What happens: Start with a larger masked border at the early step, then shrink it later. After each blend, predict a cleaner latent, re-add noise, and denoise again to smooth boundaries. - Why it exists: Early on, the model needs freedom to imagine plausible edges; later, it must lock onto the coarse geometry. Resampling removes harsh seams. Without this, you’d see visible halos or washed-out structure. - Example: The bookshelf’s added side looks too sharp against the old pixels; resampling smooths it so the transition is invisible.

Outcome of Step 2: Decoding the final latent gives an outpainted view for each original camera, same resolution but wider FOV.

Step 3: 3DGS Refinement with Outpainted Views

What happens: Train the 3DGS again, now supervising with both the original photos and the newly outpainted views (alternating). Use standard photometric + SSIM losses for originals and add a perceptual (LPIPS) term for outpainted images to guide texture realism. Optionally re-initialize points using outpainted cues so Gaussians populate the new regions. - Why it exists: The 3D model learns from richer coverage; LPIPS helps the network care about perceptual detail, not just raw pixel error. Without this, unobserved zones stay empty or look plastic. - Example: The previously missing floor patch now has many Gaussians with the correct color/shine, and novel views render without holes.

Recipe Recap:

Inputs: 3–9+ sparse views and camera intrinsics/extrinsics. - Parameters (typical): FOV scale ~0.6; 50 denoising steps; blending at ~0.7/0.5/0.3; resampling 3× after each blend; mask from opacity threshold ~0.6. - Outputs: One widened outpainted image per input camera; a refined 3DGS that renders sharper, hole-free novel views.

What breaks without each step:

No coarse 3D prior: Edges plausible but misplaced; ghosting increases. - No multi-view conditioning: Each view’s border disagrees; 3D becomes noisy. - No mask blending: Seams or drift; structure doesn’t ‘snap’ to geometry. - No iterative scheduling/resampling: Harsh borders or mushy blur. - No LPIPS in refinement: Textures look flat; details lost.

04Experiments & Results

The Test: The team measured how well the final 3D reconstructions render new views using standard metrics: - PSNR (higher is sharper/cleaner), - SSIM (higher is more structurally correct), - LPIPS (lower feels more realistic to people), - FID (lower is more natural-looking overall). They tested different numbers of input views (3, 6, 9) to simulate sparser or richer data.

The Competition: GaMO was compared with strong baselines: 3DGS, FSGS, InstantSplat, Difix3D, GenFusion, and GuidedVD-3DGS. For fairness, the same initial geometry setup (DUSt3R) was used, except InstantSplat.

Scoreboard with Context (6 views):

Replica: GaMO reached PSNR 25.84, SSIM 0.877, LPIPS 0.109, FID 72.95. Compared to GuidedVD-3DGS, that’s slightly higher PSNR (+0.17 dB), clearly higher SSIM, about 25.9% lower LPIPS, and 4.3% lower FID—like getting an A when the previous best got an A-, but with much nicer-looking details. - ScanNet++: GaMO hit PSNR 23.41, SSIM 0.835, LPIPS 0.181, FID 108.06. That’s an 11.3% LPIPS and 11.9% FID improvement over GuidedVD-3DGS—like jumping from a B to a solid A- in perceptual quality.

Speed: GaMO completes the whole pipeline in under 10 minutes on a single RTX 4090 for typical 6-view indoor scenes, roughly 25× faster than heavy video-diffusion pipelines that take 3+ hours. That’s the difference between a quick coffee break and a whole afternoon.

Sparser and Denser Settings (3 and 9 views):

3 views (hard mode): GaMO still improves structure and fills many holes, outperforming most baselines. - 9 views (easier): GaMO further raises SSIM/PSNR and drops LPIPS; even when others improve with more views, GaMO keeps the edge.

A Surprising Finding: More invented novel views did not help. In fact, adding many diffusion-generated new poses often hurt SSIM/LPIPS due to multi-view inconsistencies. Outpainting the original views, by contrast, steadily improved both structure and perceived quality. That’s a strong signal that ‘widen where you stand’ beats ‘jump to new stands’ when inputs are sparse.

Generalization: On Mip-NeRF 360 (indoor/outdoor, big spaces), GaMO had the best average metrics among tested methods, maintaining background completeness and detail where others had holes or oversmoothing.

Ablations that Matter:

Augmented conditioning (warped features + preserved center) reduced hallucinations in known areas. - Mask latent blending (especially hard masks) improved alignment and edges. - Noise resampling after blending smoothed seams and bumped PSNR/SSIM. - In refinement, re-initializing points from outpainted views and adding perceptual loss filled holes and sharpened textures.

Takeaway: Across datasets, view counts, and tests, GaMO consistently produced cleaner, more complete, and more geometrically faithful reconstructions—faster than diffusion-heavy pipelines.

05Discussion & Limitations

Limitations:

Unseen-forever problem: If no input camera ever saw a spot (e.g., behind a thick pillar), GaMO can’t truly recover its exact look; it can only make plausible guesses. - View placement sensitivity: If all photos cluster on one side or camera poses are off, coverage and consistency drop. - Reliance on coarse priors: Bad coarse geometry (e.g., from poor lighting or motion blur) can misguide outpainting.

Required Resources:

A pre-trained multi-view diffusion model, DUSt3R for initialization, and a GPU (e.g., RTX 4090 for reported speeds). - A few calibrated views (3–9+), with known camera parameters works best.

When NOT to Use:

Completely occluded targets that no photo even hints at—outpainting won’t uncover magic details. - Highly dynamic scenes (moving people/objects) where views disagree in time; this pipeline assumes a mostly static scene. - Very noisy or badly exposed inputs where initial geometry is unreliable.

Open Questions:

Adaptive FOV scaling: Can the system pick how much to widen each view based on confidence and scene layout? - Hybrid strategies: Could we combine outpainting with a few carefully chosen novel views for truly occluded areas? - Better geometry priors: Would stronger or learned coarse priors further reduce drift and speed up refinement? - Robustness to bad poses: Can we integrate pose correction or uncertainty modeling in the loop to handle casual captures better?

06Conclusion & Future Work

3-sentence summary: GaMO reframes sparse-view 3D reconstruction by outpainting each original photo’s edges, preserving geometry while expanding coverage. It guides a zero-shot multi-view diffusion process with a coarse 3D prior, a smart opacity mask, and stepwise blending with noise resampling to produce consistent, wide-FOV images. Training 3D Gaussian Splatting on both original and outpainted views yields cleaner, hole-free, and more faithful novel views—fast.

Main Achievement: Showing that ‘widen where you stand’ (geometry-aware outpainting) is a superior, faster, and more stable paradigm than generating many new camera views for sparse inputs.

Future Directions: Explore adaptive FOV selection, integrate pose refinement, and design hybrid schemes that add only a few geometry-aware novel views for truly occluded regions.

Why Remember This: GaMO turns a tricky 3D problem into a simpler one by extending trustable views rather than inventing risky new ones—and demonstrates that a little geometry guidance plus thoughtful blending can beat brute-force generation, both in quality and speed.

Practical Applications

•Real estate: Create fast, convincing virtual tours from a handful of smartphone photos.
•Construction and BIM: Update digital twins of rooms with minimal captures while keeping geometry accurate.
•Retail layout: Scan store sections quickly and test product placements in reliable 3D.
•AR/VR content: Build immersive, stable scenes with fewer artifacts for headsets and apps.
•Robotics: Generate consistent 3D maps from sparse images to improve navigation and planning.
•Cultural heritage: Digitize small exhibits or rooms with limited photos while preserving structure.
•Education/training: Produce quick, clean 3D scenes for simulations and classroom demos.
•Film/game previsualization: Block out sets with limited on-set photos and get stable previews.
•Facility management: Rapidly document interiors for maintenance planning with minimal shots.

Version: 1