3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework

Tobias Sautter; Jan-Niklas Dihlmann; Hendrik P. A. Lensch

3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework

Intermediate

Tobias Sautter, Jan-Niklas Dihlmann, Hendrik P. A. Lensch12/19/2025

arXiv PDF

Key Summary

•3D-RE-GEN turns a single photo of a room into a full 3D scene with separate, textured objects and a usable background.
•It uses a clever ‘app-style’ prompt (Application-Querying) to inpaint hidden parts of objects by showing the model both the full scene and the cutout at the same time.
•A geometry model estimates the camera and a point cloud for the scene and background, so the 3D layout matches the original photo’s perspective.
•Each object is generated as a textured 3D mesh and then precisely placed back into the scene with differentiable rendering and a mix of 2D and 3D losses.
•A new 4-DoF ground-alignment step locks floor objects to the floor plane, preventing floating or sinking and greatly improving physical realism.
•The method also builds a background mesh, which constrains objects during optimization and supports lighting, shadows, and physics in VFX and games.
•On synthetic benchmarks, 3D-RE-GEN beats strong baselines (MIDI, DepR) on Chamfer Distance, F-score, IoU, and Hausdorff Distance.
•Ablations show both the Application-Querying step and the 4-DoF ground constraint are critical for top performance.
•The pipeline is modular, runs on one or more GPUs, and can produce an editable 3D scene with around 10 objects in under 20 minutes on a single high-end GPU.
•Limitations include sensitivity to segmentation masks, occasional geometry transformer uncertainty, and stochastic generative outputs.

Why This Research Matters

Turning a single image into an editable 3D scene can save artists days of manual work and open 3D creation to more people. With a real background and floor alignment, the results are not just pretty—they’re usable for lighting, shadows, physics, and gameplay. VFX teams can quickly prototype scenes, add effects, and adjust props without rebuilding everything from scratch. Game studios can generate starter levels or art-blockers right from concept images and iterate faster. Architecture, AR/VR, and robotics benefit too: a grounded 3D layout helps with visualization, interaction, and simulation. Because the method is modular, it can evolve as better segmentation, inpainting, and 2D→3D models arrive. Overall, it brings practical, production-ready realism to single-image 3D scene creation.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you have just one snapshot of your living room and you want to walk around inside it like a video game. That sounds magical, right? But turning a flat picture into a real 3D place is tricky.

🥬 The Concept (3D Reconstruction): 3D reconstruction is turning 2D pictures into 3D models that you can orbit, light, and edit. How it works:

Find what’s in the picture (objects, walls, floor).
Guess the camera’s spot and angle at the moment the photo was taken.
Build shapes and textures for each object and the background.
Place everything so it lines up with the photo. Why it matters: Without it, artists must model rooms by hand, which is slow and expensive.

🍞 Anchor: Take a photo of a bedroom; 3D reconstruction outputs a navigable 3D bedroom with a bed, lamp, and walls you can light and render.

🍞 Hook: You know how a jigsaw puzzle picture looks correct only if the pieces are placed just right? A 3D scene must also place each object correctly so it rests on the floor and doesn’t float.

🥬 The Concept (Camera Pose Estimation): Camera pose estimation figures out where the camera was and where it was pointing when the photo was taken. How it works:

Detect visual clues like edges and corners.
Infer the camera’s position and angle that would produce the image.
Build a rough 3D point cloud that matches the photo perspective. Why it matters: Without a correct camera, objects won’t align with the image—chairs might slide or scale oddly.

🍞 Anchor: If the camera is estimated too high, a table may look too short in 3D even though it looked fine in 2D.

🍞 Hook: Think of a stage play: props (objects) and the set (background) both matter. Many older methods made nice props but forgot the set.

🥬 The Concept (Background Extraction): Background extraction builds a clean 3D stage (walls, floor, ceiling) that matches the photo. How it works:

Remove objects from the image with inpainting to get an "empty room".
Estimate scene geometry and make a mesh from the point cloud.
Project the empty-room image onto the mesh for textures. Why it matters: Without a background, objects float in undefined space; lighting, shadows, and physics become impossible or fake.

🍞 Anchor: Removing a couch from a living room photo, then building the room shell lets you place a new 3D couch back in exactly the right spot with real shadows.

🍞 Hook: Have you ever tried to guess what’s behind someone standing in a group photo? That’s hard—even for computers.

🥬 The Concept (Occlusion Handling): Occlusion handling figures out the hidden parts of objects that are covered by others in the photo. How it works:

Identify the visible part with a mask.
Use the scene’s style, lighting, and perspective as clues.
Recreate the missing geometry and texture consistent with the scene. Why it matters: Without handling occlusions, objects stay incomplete or wrong-shaped when rebuilt in 3D.

🍞 Anchor: If a chair leg is hidden behind a table, occlusion handling helps imagine and rebuild that missing leg so the chair doesn’t float.

🍞 Hook: Imagine asking a friend to finish your drawing, but you show them both the whole picture and the part you cut out—they’ll do a better job.

🥬 The Concept (Application-Querying, A-Q): A-Q is a special “app-style” visual prompt that shows the full scene on one side and the cutout object on a white background on the other, guiding the inpainting model to complete the object correctly. How it works:

Build a two-panel UI-like image: left = full scene with an outlined target; right = the cutout on white.
Ask the image-editing model to fill in the right panel using clues from the left.
Extract the completed object image for 3D generation and also an empty-room background from the scene. Why it matters: Without A-Q, inpainting often guesses wrong materials or shapes because it lacks scene context.

🍞 Anchor: With A-Q, a partly hidden chair becomes a full, front-facing chair on white, matching the room’s style and lighting.

The world before: Single-image 3D often focused on single objects, not full rooms. Holistic scene methods guessed everything at once but struggled with complex layouts, object scales, and fine textures. Retrieval-based methods depended on model databases, which limited variety and style matching. Even newer diffusion pipelines often missed two crucial pieces: a usable background mesh and strict physical rules (like touching the floor).

The problem: Artists need modifiable, textured meshes that are placed where they truly belong (e.g., chairs on floors, lamps on tables), plus a background for lighting and physics. Previous pipelines lacked strong scene constraints and often produced floating items, merged objects, or missing backgrounds.

Failed attempts: All-at-once methods tended to blur details and misplace items. Pipelines without global context created occlusion errors. And scene generation without a ground constraint made objects drift.

The gap: A method to combine the strengths of top models while adding context-aware object completion and a firm ground alignment—plus a real background mesh—was missing.

Real stakes: This changes how fast VFX and game artists can work. Instead of days of manual modeling, they can start from a single concept image and get an editable, production-ready 3D scene—shadows, light bounces, and physics included—in minutes.

02Core Idea

🍞 Hook: Picture building a LEGO room from a photo. If you first cleanly rebuild each brick (object), recover the room shell (background), and then snap bricks onto the floor, your LEGO room looks real.

🥬 The Concept (Generative Framework): The key idea is a compositional generative framework that combines best-in-class models for segmentation, inpainting, camera/geometry, and 2D→3D asset creation, then assembles everything with a physics-aware optimizer. How it works:

Segment objects and create clean object images and an empty-room background using A-Q inpainting.
Estimate camera and point clouds from both original and empty-room images.
Generate textured 3D meshes per object.
Optimize their placement using a differentiable renderer and a new 4-DoF ground constraint. Why it matters: Without combining context and physics, scenes look wrong—objects float, sizes drift, and lighting can’t be simulated correctly.

🍞 Anchor: From one living-room photo, you get a textured couch mesh, a lamp mesh, a real room shell, and both items are locked to the floor exactly where they belong.

Three analogies for the same idea:

Theater analogy: Build the stage (background), craft each prop (objects), then tape marks on the floor so props hit their exact spots (4-DoF ground alignment).
Puzzle analogy: Complete each piece (object with A-Q), recover the board (background), and use alignment lines so every piece snaps in perfectly (composite losses + ground plane).
Cooking analogy: Prep ingredients separately (object meshes), set the plate (background), then plate with a template ring (4-DoF) so everything sits where it should.

🍞 Hook: You know how GPS helps you place pins exactly on a map? A 3D layout needs a reference frame too.

🥬 The Concept (Point Cloud and Ground Plane): A point cloud is a 3D sprinkle of dots that outlines the room; a ground plane is the floor that anchors objects. How it works:

Use a geometry transformer to get point clouds and camera from both original and empty-room images.
Fit a floor plane (with a robust method like RANSAC) to guide object placement. Why it matters: Without a reliable floor plane, chairs and tables won’t sit or align correctly, breaking realism.

🍞 Anchor: Estimating the floor plane lets the method slide the couch along X/Z but keep Y locked to the floor.

🍞 Hook: Think of a shopping cart that can roll forward/back and left/right, rotate around, and get bigger/smaller—but it can’t lift off the ground.

🥬 The Concept (4-DoF Differentiable Optimization): The 4-DoF planar model restricts movement to floor-plane translation (x, z), yaw rotation, and uniform scale while forbidding vertical drift. How it works:

Classify objects as floor-touching or not via mask-vs-floor IoU.
Use 4-DoF for floor objects, 5-DoF for free-floaters.
Minimize a composite loss (2D silhouette + 3D geometry + background bounds) with differentiable rendering to align perfectly. Why it matters: Without 4-DoF, floor objects can float, tilt, or sink, ruining realism.

🍞 Anchor: The coffee table stays glued to the floor plane while rotating to match the photo’s perspective.

Before vs. after: Before, single-image scene generation often skipped the background and allowed unrealistic object placement. After, 3D-RE-GEN delivers a complete background and locks layouts to physics (floor plane), which dramatically improves consistency and editability.

Why it works (intuition):

Context fixes occlusions: A-Q shows the model the scene and the incomplete object together, so the model completes the object in the right style, material, and lighting.
Geometry guides layout: Dual-image (original + empty-room) geometry yields better camera and scene point clouds for more accurate placement.
Physics keeps it real: 4-DoF enforces floor contact, removing common floating/sinking artifacts.
Multi-view-in-spirit: Even though it’s a single image, using both the original and empty-room images acts like a pseudo multi-view for better alignment.

Building blocks:

Grounded SAM for segmentation and clean masks.
A-Q for context-aware inpainting of objects and the background.
Geometry transformer (e.g., VGGT) for camera and point clouds.
2D→3D generator for textured object meshes.
Differentiable rendering with composite losses for precise placement.
4-DoF planar model (and 5-DoF when needed) to respect physics.

03Methodology

At a high level: Single image → Segmentation and A-Q inpainting → Camera and point clouds + background mesh → 2D→3D object meshes → Differentiable placement with 4-DoF/5-DoF → Full 3D scene.

Step 1: Find and mask objects 🍞 Hook: You know how highlighting with a marker helps you focus on key words? We do that for objects.

🥬 The Concept (Instance Segmentation): Instance segmentation draws precise outlines (masks) around each object. What happens: Grounded SAM proposes masks per object. A lightweight UI lets a user fix mistakes. Why it exists: Bad masks lead to bad assets and bad placement—silhouettes won’t match and geometry will misalign. Example: A red couch gets a clean binary mask so only couch pixels are kept for the object image.

🍞 Anchor: With a correct chair mask, the pipeline learns the true chair outline for later 2D silhouette matching.

Step 2: Fill in hidden parts and build an empty room 🍞 Hook: Imagine repairing a torn sticker by looking at the poster it came from.

🥬 The Concept (Application-Querying, A-Q): A UI-style two-panel image shows (left) the full scene with the object outlined and (right) the cutout object on white to be completed. What happens: The image editor fills the right panel using lighting and style clues from the left; it also produces an empty-room background by removing all objects. Why it exists: Without scene context, inpainting guesses shapes/materials poorly. With A-Q, it reconstructs plausible, scene-consistent details. Example: A lamp partly hidden behind a plant is completed in the right panel with matching metal and shade texture; the background becomes a clean room shell.

🍞 Anchor: The completed lamp-on-white is perfect input for 2D→3D meshing, and the empty room becomes the stage for all objects.

Step 3: Recover camera and scene geometry; build background 🍞 Hook: If you can place your camera back where it was, it’s much easier to put everything where it belongs.

🥬 The Concept (Camera and Point Clouds): A geometry transformer ingests both the original photo and the empty-room image to output cameras and aligned point clouds. What happens: Keep the original camera and the scene point cloud; also align background points. Convert the background point cloud into a triangle mesh and texture it with the empty-room image (camera projection). Why it exists: Without a well-aligned camera and background, objects can’t be placed with realistic contact and perspective. Example: The room’s floor/wall/ceiling mesh receives a baked texture that looks consistent under the original viewpoint.

🍞 Anchor: The background mesh acts like a physical boundary for objects and supports shadows and physics later.

Step 4: Generate each object as a textured 3D mesh 🍞 Hook: Think of making a clay model from a clear front-view photo.

🥬 The Concept (2D→3D Object Generation): A generative model turns each completed object-on-white image into a textured 3D mesh. What happens: The model captures shape, materials, and surface details as a modifiable mesh (not just a fuzzy blob). Why it exists: Artists need meshes they can edit, rig, render, and simulate. Example: The completed couch image becomes a mesh with fabric normals and seams.

🍞 Anchor: Now you have separate, game/VFX-ready asset files for couch, table, lamp, etc.

Step 5: Place objects with differentiable rendering and composite losses 🍞 Hook: Like adjusting a picture frame until its shadow outline perfectly matches the mark on the wall.

🥬 The Concept (Differentiable Rendering with Composite Loss): Rendering is made differentiable so we can nudge object pose and scale to reduce errors measured by losses that compare 2D and 3D signals. What happens:

2D Silhouette Loss: Ensures the rendered object outline matches the ground-truth mask (robust edges via Dice + Focal).
3D Geometric Loss: Brings the object surface close to the target object’s point cloud (point-to-mesh distance).
Background Bounding Box Loss: Keeps objects within realistic room bounds on X/Z so they don’t penetrate walls. Why it exists: Without these losses, objects won’t align from all angles or could slide into walls. Example: The rendered chair silhouette tightens to match its mask while its 3D surface converges to the stenciled point cloud.

🍞 Anchor: After optimization, the chair both looks right in the photo view and sits properly in 3D space.

Step 6: Enforce floor realism with 4-DoF for ground objects 🍞 Hook: Shopping cart rules: slide left/right, forward/back; rotate; scale—but don’t fly.

🥬 The Concept (4-DoF Planar Model vs. 5-DoF):

5-DoF model: for non-floor objects (translation, yaw, scale).
4-DoF planar model: for floor objects; motion restricted to the ground plane with yaw and uniform scale. What happens:

Check overlap between object mask and floor mask.
If touching floor, project bottom vertices to the floor plane and optimize in its local coordinates (tx', tz', ry', s) before transforming back to world. Why it exists: Without the 4-DoF constraint, tables float or tilt subtly, breaking realism and metrics. Example: The coffee table is initialized on the plane and then slides/rotates only along that plane during optimization.

🍞 Anchor: The final layout looks physically believable: feet touch the floor; nothing hovers or sinks.

Secret sauce:

A-Q inpainting feeds the 2D→3D model with clean, style-matched object views.
Dual-image geometry (original + empty-room) makes camera and background tighter.
4-DoF locks the most common objects (floor ones) to realistic motions, massively stabilizing optimization.

Concrete mini example:

Input: A living room photo with a couch (touching floor) and a wall-hanging picture (not touching floor).
Masks: Clean couch and picture masks.
A-Q: Completed couch-on-white; empty-room background.
Geometry: Camera + point clouds aligned; background meshed and textured.
2D→3D: Couch mesh and picture mesh generated.
Placement: Couch uses 4-DoF (slides/rotates on floor); picture uses 5-DoF (free translation/yaw/scale). Losses pull both into alignment.
Result: A cohesive, editable scene ready for lighting and physics.

04Experiments & Results

🍞 Hook: Imagine a school race where each runner not only has to be fast but also stay inside perfect lanes. That’s like 3D scene generation—accuracy and realism both count.

🥬 The Concept (Evaluation Setup): The team tested on diverse synthetic scenes (CGTrader) and real photos, comparing to strong baselines (MIDI, DepR) using scene-level 3D metrics. How it works:

Convert both predicted and ground-truth scenes to point clouds, normalize to unit scale, and align with ICP.
Measure Chamfer Distance (CD), F-score, Bounding Box IoU (BBOX-IOU), and Hausdorff Distance. Why it matters: Without fair, aligned comparisons, scores can mislead and not reflect true scene quality.

🍞 Anchor: Lower CD/Hausdorff and higher F-score/IoU mean cleaner shapes, fewer errors, and better object placements.

🍞 Hook: You know how spelling and grammar both matter in writing? In 3D, both shape quality and layout accuracy matter.

🥬 The Concept (Chamfer Distance, F-score, IoU, Hausdorff):

What they are:
- Chamfer Distance: How close two point sets are overall (lower is better).
- F-score: Balance of precision and recall for surface points (higher is better).
- BBOX-IoU: Overlap of predicted vs. true object boxes (higher is better), reflecting placement/scale.
- Hausdorff Distance: The worst-case point error (lower is better), catching big outliers. How they work: After ICP alignment, compute these distances/overlaps on the whole scene. Why they matter: A scene might look okay on average (CD) but still have glaring outliers (Hausdorff) or misplacements (IoU).

🍞 Anchor: A low Hausdorff score is like saying, “No big embarrassing spikes or floating blobs in the scene.”

The scoreboard (contextualized):

3D-RE-GEN achieves CD 0.011 (best), F-score 0.85 (like an A when others score Bs), IoU 0.63 (highest, meaning better placement/size), and Hausdorff 0.33 (lowest, fewer ugly outliers). DepR and MIDI trail across these.
Qualitatively, 3D-RE-GEN scenes show sharp, separate objects, correct floor contact, and a textured background. MIDI sometimes merges or duplicates geometry; DepR can produce blob-like, misaligned shapes.

Surprising findings:

Background matters a lot: With a real background mesh, objects land and light more realistically, improving both numbers and looks.
A-Q dramatically helps occlusions: Without it, completed objects often inherit wrong materials or shapes, tanking both 2D and 3D scores.
The method generalizes even to some outdoor cases: While organic assets (trees/foliage) are hard, structured items (cars) align well thanks to the ground-plane logic.

Ablations (what breaks without key parts):

No 4-DoF: 3D metrics drop (e.g., IoU and Hausdorff worsen) and objects tend to float or misalign with perspective.
No A-Q: Both 2D (SSIM, LPIPS) and 3D metrics fall, backgrounds don’t reconstruct well, and occluded assets become incomplete or wrong.

Runtime and practicality:

Around 17–20 minutes for ~10 objects on a single high-end GPU; 7–8 minutes on 4 GPUs. Most time goes to 2D→3D asset creation.
Modular design supports swapping lighter 2D→3D models to reduce memory/compute when needed.

05Discussion & Limitations

Limitations (honest look):

Mask sensitivity: If initial masks are sloppy, inpainting completes the wrong shapes and silhouette losses guide placements poorly.
Geometry uncertainty: Low-confidence points get dropped, which can leave holes in the background mesh or cause slight camera misalignments.
Optimization traps: Differentiable placement can get stuck in local minima (e.g., 180° flips look similar in silhouette), yielding suboptimal poses.
Object granularity: Complex combos (like a shelf with books) are generated as a single mesh, limiting fine-grained editing unless manually separated.
Generative stochasticity: Random seeds in inpainting and 2D→3D can produce slightly different results each run, which hurts strict reproducibility.

Required resources:

A reasonably powerful GPU (or multiple) speeds things up; the heaviest part is 2D→3D generation.
A quick human pass to fix tricky masks is often worth it for quality.

When not to use:

If you need deterministic, bit-for-bit repeatability (e.g., strict versioning), the stochastic steps can be an issue.
Scenes dominated by highly organic assets (dense foliage) may underperform until 2D→3D models trained on such data improve.
If you already have multi-view data, a dedicated multi-view pipeline might beat single-image assumptions.

Open questions:

Can hierarchical constraints extend beyond the floor (e.g., auto-placing cups on tables, lamps on sideboards) using multiple planar anchors?
How can we better escape optimization local minima—e.g., multi-hypothesis starts, learned priors, or extra cues like depth?
Can background reconstruction incorporate material/BRDF estimation robustly enough to outperform baked textures in diverse lighting?
How to make outputs more reproducible without sacrificing generative quality—e.g., seed management, latent caching, or guided diffusion controls?

06Conclusion & Future Work

3-sentence summary: 3D-RE-GEN turns a single image into a complete, editable 3D scene by combining context-aware inpainting (A-Q), accurate camera/geometry recovery, and physics-aware object placement. A new 4-DoF ground constraint locks floor objects to the floor plane, while a reconstructed background mesh adds realism and constrains optimization. The result is state-of-the-art scene quality with separate, textured assets and a usable environment for VFX and games.

Main achievement: Showing that a modular, compositional pipeline—powered by A-Q inpainting, dual-image geometry estimation, and 4-DoF planar optimization—produces physically plausible, artist-ready 3D scenes from a single photo.

Future directions: Add hierarchical constraints (e.g., table-top placement), extend to multi-view inputs, and integrate robust material/BRDF estimation. Improve reliability under heavy occlusions and organic shapes by training 2D→3D models on broader data and exploring multi-hypothesis optimization to dodge local minima.

Why remember this: It’s a practical leap for creators—minutes instead of days—from concept art to editable 3D. By respecting scene context and simple physics (the floor!), 3D-RE-GEN makes single-image scene reconstruction not just pretty, but production-ready.

Practical Applications

•Rapid VFX previsualization: turn a reference photo into a 3D stage with editable props for quick shot planning.
•Game art blocking: generate starter environments from concept images and iterate placements and lighting.
•Set extension: rebuild photographed rooms and extend them digitally with consistent perspective and lighting.
•Virtual staging for real estate: remove furniture, then place 3D assets on the recovered floor for new looks.
•AR interior design: try couches, tables, and lamps in a photographed room with proper floor contact.
•Education and training: teach scene layout, lighting, and physics using single images turned into 3D labs.
•Robotics simulation: create approximate indoor layouts from photos to test navigation and manipulation.
•Cinematic relighting: use the background mesh to cast realistic shadows and bounce light from added CG props.
•Asset library growth: inpaint occluded items to create clean images, then generate new 3D meshes for reuse.
•Forensics/analysis: reconstruct approximate room geometry from a single image for spatial reasoning.

Version: 1