ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

Yawar Siddiqui; Duncan Frost; Samir Aroudj; Armen Avetisyan; Henry Howard-Jenkins; Daniel DeTone; Pierre Moulon; Qirui Wu; Zhengqin Li; Julian Straub; Richard Newcombe; Jakob Engel

ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

Intermediate

Yawar Siddiqui, Duncan Frost, Samir Aroudj et al.1/16/2026

arXiv PDF

Key Summary

•ShapeR builds clean, correctly sized 3D objects from messy, casual phone or glasses videos by using images, camera poses, sparse SLAM points, and short text captions together.
•It uses a rectified flow transformer to denoise a special 3D code (VecSet latent) that a decoder turns into a full 3D mesh.
•Unlike many past methods, ShapeR does not need perfect cut-out masks and stays robust under clutter, occlusions, blur, and odd viewpoints.
•SLAM points give ShapeR metric scale, so objects come out the right real-world size and can be put back into the scene correctly.
•Training relies on heavy, on-the-fly augmentations and a two-stage curriculum: first on diverse single objects, then on realistic synthetic scenes.
•During inference, a detector finds objects, 3D points are projected to 2D to softly tell the model what to focus on, and a VLM provides a helpful caption.
•On a new in-the-wild benchmark with 178 objects, ShapeR beats strong baselines, improving Chamfer distance by about 2.7× and winning most user preferences.
•Ablations show SLAM points, augmentations, two-stage training, and 2D point-mask prompting are all crucial for robustness.
•The method outputs metrically consistent object meshes that can be composed into full scenes automatically.
•Code, weights, and the evaluation dataset will be released, pushing toward unifying generative models with metric scene reconstruction.

Why This Research Matters

ShapeR turns the kind of casual videos people already take into accurate, correctly sized 3D objects. That means AR apps can place furniture and tools at the right scale without special scanning, and robots can plan grasps and paths using reliable geometry. Creators can build VR/AR assets quickly from real spaces instead of modeling by hand. Retailers and warehouses can inventory items from quick walk-throughs, even with clutter and occlusions. Insurers and real-estate pros can document spaces with structured, metric 3D, not just photos. Because ShapeR avoids fragile segmentations and handles messy inputs, it’s a practical step toward everyday 3D understanding. Releasing code and a realistic benchmark should also speed up progress across the community.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re trying to build a LEGO model of a messy bedroom just from a quick, wobbly video on your phone. Some toys are behind others, lights are dim, and your hand shakes. That’s hard, right?

🥬 The Concept (3D Shape Generation): 3D shape generation is when a computer turns pictures or videos into full 3D objects you can spin around. How it works:

Look at images of an object from different views
Guess a 3D shape that explains those views
Improve the guess until it matches the images well Why it matters: Without this, apps and robots can’t understand or interact with the real world in 3D. 🍞 Anchor: Think of a camera slowly circling a chair and the computer building a 3D model of that chair.

🍞 Hook: You know how a class photo gets messy if kids block each other and the picture is blurry?

🥬 The Concept (Casual Captures): Casual captures are regular, non-scanning videos where people move freely, so views are imperfect, cluttered, and shaky. How it works (reality, not a method): People film naturally; objects get occluded; backgrounds are busy; frames are low-res or blurry. Why it matters: Many AI methods expect clean, centered, well-lit objects, so they fail on everyday videos. 🍞 Anchor: A quick kitchen tour video where the toaster is half hidden behind a bowl—exactly the kind of input the model must handle.

🍞 Hook: You know how a ruler helps you draw lines at the right length?

🥬 The Concept (Metric Scale): Metric scale means a 3D model keeps real-world sizes (a mug stays mug-sized). How it works:

Use camera motion and SLAM to recover scale
Reconstruct objects in a normalized space
Rescale meshes back to the real scene units Why it matters: Without metric scale, a mug might come out the size of a couch. 🍞 Anchor: When you place the reconstructed lamp back on your desk in AR, it fits exactly where it belongs.

🍞 Hook: Imagine trying to draw a toy car when someone only gives you one fuzzy photo.

🥬 The Problem: Past 3D methods did great on clean, cut-out objects, but stumbled on real videos with clutter, occlusions, and bad angles. How it shows up:

Scene-centric methods make one big mesh and miss parts hidden by other objects
Single-image generative methods look nice but aren’t the right size (non-metric)
Methods that need perfect masks break when masks are noisy Why it matters: Real homes, offices, and stores are messy. We need methods that work there. 🍞 Anchor: A living room video where a sofa is partly behind a table—older methods leave the hidden sofa side empty or the scale wrong.

🍞 Hook: You know how detectives solve cases better when they combine footprints, camera footage, and witness notes?

🥬 The Gap: There wasn’t a system that used many complementary clues—images, 3D points, and words—to make robust, metric 3D objects from casual videos. How it works (what’s missing):

No reliable metric grounding from just pretty images
No training that mirrors real-world messiness
No way to focus on the right object without perfect masks Why it matters: Without these, you get fragile reconstructions that fall apart in daily life. 🍞 Anchor: A cup on a shelf gets reconstructed wrongly if the model sees only pictures and no 3D point hints.

🍞 Hook: Think of practicing math: you start easy and then tackle harder problems.

🥬 The Concept (Curriculum Learning): Train models first on simpler, clean data, then on harder, realistic data so they become robust. How it works:

Stage 1: massive single-object data with strong augmentations
Stage 2: fewer categories but realistic scenes with occlusion and SLAM noise
The model learns both variety and realism Why it matters: Without a curriculum, models either overfit the toy world or can’t handle the messy real one. 🍞 Anchor: First learning to ride on a quiet sidewalk, then handling busy streets.

The world before ShapeR: scene-centric methods gave incomplete objects under occlusion; single-view 3D generation looked good but wasn’t metric; many pipelines needed perfect 2D masks. People tried adding more views or better segmentations, but casual captures still broke them.

What ShapeR brings: a multimodal, object-centric approach that fuses images, sparse SLAM points, and captions; a flow-based generator that keeps metric scale; and a training plan that injects real-world messiness through on-the-fly augmentations and a two-stage curriculum.

Why you should care: This lets AR place furniture at the right size, helps robots grasp the correct handle, assists home inventory without pro scans, and makes 3D content creation faster—all from the kind of videos people already take.

02Core Idea

🍞 Hook: You know how solving a puzzle gets easier if you have the picture on the box, a few key edge pieces, and a note telling you it’s a beach scene?

🥬 The Aha! in one sentence: Combine three complementary clues—posed images, sparse SLAM points with real scale, and a short caption—to guide a rectified flow model that denoises a 3D latent and outputs a clean, metrically accurate object mesh, even from messy videos.

Multiple Analogies:

Detective team: photos (witness videos), SLAM points (footprints with true distances), caption (suspect description). Together, the case (shape) is solved.
Connect-the-dots + coloring book: SLAM points are the dots (structure), images are the colors/shading (appearance cues), caption hints at the object type. Follow a smart recipe (rectified flow) to complete the picture in 3D.
Orchestra: strings (images), brass (points), lyrics (caption). The conductor (flow transformer) blends them into a harmonious 3D shape.

🍞 Anchor: A toaster on a cluttered counter—images show shiny edges, SLAM points trace the boxy geometry, caption says “small silver toaster.” The model outputs a correctly sized toaster cube with slots on top.

🍞 Hook: You know how you can start with a scribble and gradually clean it up into a neat drawing?

🥬 The Concept (Rectified Flow): Rectified flow teaches a model to push a noisy guess steadily toward the correct 3D code. How it works:

Start with random noise in a 3D latent space
At each tiny step, predict the velocity pointing toward the true latent
Integrate these steps from t=1 to t=0 to land on the right code Why it matters: It’s stable and learns clear “paths” from noise to shape, handling complex objects well. 🍞 Anchor: Like sliding a clay blob into a vase shape with gentle, well-aimed pushes.

🍞 Hook: Imagine secret blueprints encoded as a set of LEGO instructions.

🥬 The Concept (VecSet 3D VAE): A 3D VAE turns meshes into compact latent “sets of vectors” and back into distance fields. How it works:

Encoder turns surface and edge points into a compact variable-length latent (VecSet)
Decoder turns the latent into an SDF (signed distance function)
Marching Cubes extracts the final mesh from the SDF Why it matters: The latent is flexible and efficient, perfect for a flow model to denoise. 🍞 Anchor: The model stores a chair as a short instruction list, then rebuilds the chair surface precisely.

Before vs After:

Before: Models needed clean masks or failed under occlusions; single-view methods weren’t metric; scene-centric methods missed hidden parts.
After: ShapeR uses object-centric crops, SLAM points for scale, and multimodal conditioning to reconstruct complete, metrically accurate shapes without explicit masks.

Why It Works (intuition):

Ambiguity shrinks when you mix independent clues: images say “what it looks like,” points say “where things are and how big,” captions say “what it probably is.”
Rectified flow learns reliable paths in latent space, so generation is stable.
Point projections to 2D act like soft focus cues, telling the model which pixels belong to the target object without hard segmentation.
Curriculum + augmentations inoculate the model against real-world mess (clutter, blur, occlusions).

Building Blocks (each with a mini-sandwich):

🍞 Hook: You know how a chef tastes and adjusts using salt, spice, and sweetness together? 🥬 Multimodal Conditioning: Fuse images, points, and text tokens inside the transformer via cross-attention so each modality informs the others. How it works: encode points (3D sparse conv), images (frozen features + camera rays), text (frozen encoders); the transformer attends across them while denoising the latent. Why it matters: One modality’s weakness is covered by another’s strength. 🍞 Anchor: Dim lighting? Points and caption still keep the shape on track.

🍞 Hook: Point to what you’re talking about when a room is crowded. 🥬 Implicit Segmentation via 2D Point Masks: Project 3D object points to each image and give their 2D footprints to the model. How it works: make binary masks from projected points; encode masks with a small CNN; concatenate with image tokens. Why it matters: The model learns “this region is my object,” without needing fragile full masks. 🍞 Anchor: The 2D dots land on the toaster area, so the model ignores the nearby blender.

🍞 Hook: Practice easy puzzles first, then harder ones. 🥬 Two-Stage Curriculum: Pretrain on diverse single objects with strong augmentations, then fine-tune on realistic scene crops with occlusions and SLAM noise. How it works: stage 1 builds broad priors; stage 2 adds real-scene robustness. Why it matters: Prevents overfitting to toy cases and boosts real-world success. 🍞 Anchor: After training, a sofa partly behind a table still comes out complete and right-sized.

03Methodology

High-level recipe: Input (posed images + SLAM points + captions) → Preprocess per object → Encode conditions → Rectified flow denoising of a 3D latent → Decode to SDF → Marching Cubes → Rescale to metric → Output mesh per object (compose into scene).

Step 1: Preprocess and detect objects

What happens: Run visual-inertial SLAM on the video to get sparse 3D points and camera poses. Use a 3D instance detector to get object boxes. For each object, collect: its point subset, a handful of representative frames where it appears, 2D projections of its 3D points (binary masks), and a short caption from a vision-language model.
Why it exists: We need object-centric bundles with metric hints (points), appearance (images), and semantic nudges (caption) so the generator knows “what,” “where,” and “how big.”
Example: A silver toaster is detected; its 3D points are cropped; 8 frames are selected; in each frame, projected points mark where the toaster is; the caption says “small silver toaster with two slots.”

🍞 Hook: Dots on a map tell you where cities are. 🥬 Concept (Sparse SLAM Points): A few reliable 3D points summarize object geometry and carry real-world scale. How it works: track image features across frames; triangulate into 3D; associate points with frames and objects. Why it matters: They anchor the model to size, depth, and pose. 🍞 Anchor: Even if the toaster is half hidden, its point cloud still outlines a boxy volume.

Step 2: Encode each modality into tokens

Points: a sparse 3D ResNet turns the object’s point cloud into point tokens.
Images: a frozen image backbone extracts features; camera rays (Plücker-like encodings) are attached so features know viewing geometry; the 2D point-mask per frame is encoded and concatenated as a focus cue.
Text: a frozen text encoder turns the caption into tokens.
Why it exists: The transformer needs comparable token streams to cross-attend and share information across modalities.
Example: For a low-light frame, the image features are uncertain, but the point tokens stay confident; the transformer balances them.

🍞 Hook: A choir needs a conductor. 🥬 Concept (Flow-Matching Transformer): The model predicts the velocity that moves a noisy latent toward the true 3D latent, step by step. How it works: concatenate and cross-attend tokens; modulate by timestep and text; integrate predicted velocities from noise to target. Why it matters: Provides stable, efficient training for high-fidelity generation. 🍞 Anchor: With every beat, the noisy shape gets crisper until it becomes a toaster.

Step 3: Denoise the latent and decode the shape

What happens: Start from Gaussian noise in VecSet latent space; integrate the transformer’s velocity field; obtain a clean latent; pass it to the VAE decoder to get an SDF; run Marching Cubes to get a mesh.
Why it exists: VecSet latents are compact and expressive; SDFs give watertight, smooth surfaces; Marching Cubes extracts triangles reliably.
Example: The toaster’s flat sides and curved edges pop out clearly in the final mesh.

🍞 Hook: Dip a cookie cutter into dough to reveal the shape. 🥬 Concept (SDF + Marching Cubes): The SDF says how far a point is from the surface; the zero level is the surface. Marching Cubes finds that surface as a mesh. How it works: evaluate SDF on grid; triangulate the zero crossing. Why it matters: Produces high-quality, consistent geometry. 🍞 Anchor: The neat shell of the toaster emerges from the SDF “dough.”

Step 4: Metric rescaling and scene composition

What happens: Objects are predicted in a normalized cube; meshes are rescaled back using the object’s point cloud, preserving size and location; place all objects back to form the scene.
Why it exists: Keeps real-world sizes and alignments.
Example: The toaster returns to its exact spot on the counter and isn’t giant.

Training “secret sauce”

On-the-fly compositional augmentations: images (background compositing, occluders, blur, resolution drops, color shifts), points (dropouts, Gaussian noise, partial trajectories, occlusions). This yields endless variety and simulates casual capture messiness.
Two-stage curriculum: Stage 1 pretraining on 600K diverse artist-made meshes with heavy augmentations; Stage 2 fine-tuning on Aria Synthetic scene crops with realistic occlusions and SLAM noise.
Implicit segmentation via 2D point masks: teaches the model where to look, without brittle full-instance masks.
Architecture choices: a FLUX-like dual/single-stream DiT that first cross-attends text, then images/points; omit positional embeddings per prior 3D work; modulate with timestep and text.
Why this is clever: Each piece patches a known failure mode—SLAM handles metric scale; masks handle object focus amidst clutter; augmentations mimic real-world quirks; curriculum balances diversity and realism.

🍞 Hook: Practice juggling beanbags before flaming torches. 🥬 Concept (Augmentations): Synthetic occlusions, noise, and blur teach robustness before the model faces real chaos. How it works: random, on-the-fly mixes across modalities. Why it matters: Without them, the model cracks under casual captures. 🍞 Anchor: After training, a blurry, partial-view toaster still comes out right.

Inference summary

Detect objects and crop their points
Pick N representative frames; project points to make 2D masks; make a caption
Encode tokens; integrate the flow from noise to latent; decode SDF; Marching Cubes
Rescale to metric and place back in scene
Typical settings: up to 16 views at 280×280, few seconds to minutes depending on hardware

04Experiments & Results

🍞 Hook: Think of a school sports day—many teams compete, and the scoreboard tells who really did best.

The Test

What: Robust object reconstructions from casual sequences with clutter, occlusions, and odd angles.
Where: A new evaluation set of 178 in-the-wild objects across 7 scenes, each with posed images, SLAM points, captions, and carefully aligned reference meshes.
Why: Existing datasets are either too clean (studio tabletop) or have incomplete object geometry; this benchmark stresses real-life difficulty.

The Competition

Posed multiview-to-3D: EFM3D, FoundationStereo fusion, DP-Recon, LIRM
Foundation image-to-3D (mostly single-view during testing): TripoSG, Direct3DS2, Hunyuan3D-2.0, Amodal3R
Scene-level methods: MIDI3D, SceneGen (both need interactive instance segmentation)

Scoreboard (with context)

ShapeR significantly lowers Chamfer distance (CD), improving roughly 2.7× over strong baselines—like getting an A+ when others get a B- or C+.
Normal Consistency (NC) and F1 are high as well, meaning surfaces are oriented correctly and reconstructions match reference shapes at fine thresholds.
User studies: across 660 preferences, ShapeR is chosen over top single-view generators about 81–88% of the time—people prefer the results visibly.
Scene comparisons: MIDI3D/SceneGen often struggle with scale and layout; ShapeR keeps objects metrically consistent and neatly arranged automatically.

Ablations (what matters most)

SLAM points: removing them hurts robustness, especially under weak visual cues—points give the shape a “skeleton” with real size.
Augmentations: turning off point or image augmentations reduces performance under noise and occlusion; background compositing and synthetic occluders are key.
Two-stage training: skipping realistic scene fine-tuning weakens results in clutter; curriculum training clearly helps generalization.
2D point-mask prompting: without it, the model sometimes reconstructs a neighboring object; with it, the model focuses on the right target.

Surprising Findings

Single-view foundation models did better than their multi-view versions in casual conditions when human-selected clean views and perfect masks were provided—yet ShapeR still wins even without such hand-holding and remains metric.
ShapeR can be adapted to single-image metric 3D by adding an external monocular metric-estimator (e.g., MapAnything), suggesting broad applicability.

🍞 Anchor: In a crowded kitchen, baselines miss hidden sides or resize items oddly; ShapeR reconstructs the toaster fully and places it exactly where it belongs.

05Discussion & Limitations

Limitations (be specific)

Few or low-quality views: If images are very blurry or the object appears only briefly, fine details can be lost or surfaces can be incomplete.
Stacked/attached items: Tight contact (e.g., books on a shelf, objects touching) can cause the mesh to include bits of neighbors.
Dependency on detection: If the 3D detector misses an object or boxes it poorly, reconstruction will miss or be degraded—downstream can’t fix upstream misses.
Thin, transparent, or reflective objects: Sparse points and image cues can be unreliable on glass, shiny metal, or very thin structures.

Required Resources

Training: hundreds of GPUs (as reported) for the VAE and the rectified-flow transformer; large-scale object and scene datasets; careful augmentation pipeline.
Inference: SLAM/SfM for points and poses, a 3D detector, and a VLM for captions; typically a GPU for reasonable runtime.

When NOT to Use

Real-time robotics with strict milliseconds latency where offline SLAM and detection are not feasible.
Dynamic scenes with moving objects during capture; the static assumption may break.
Scenarios demanding textured outputs or material properties right away—ShapeR focuses on geometry.

Open Questions

End-to-end detection + reconstruction: Can we jointly train to reduce dependency on separate detectors and reduce error propagation?
Texture/material generation: How to add photoreal textures and PBR materials while keeping metric geometry?
Outdoor and very large scenes: How to scale to city blocks or forests, and handle different SLAM regimes?
Efficiency: Can rectified flow be distilled or pruned for edge devices without losing robustness?
Self-supervision from casual videos: Can we learn more directly from unlabeled in-the-wild sequences to reduce data curation costs?

06Conclusion & Future Work

Three-sentence summary: ShapeR fuses images, SLAM points, and captions inside a rectified-flow transformer that denoises a 3D latent to produce metrically accurate, complete object meshes from messy, casual videos. Robustness comes from heavy on-the-fly augmentations and a two-stage curriculum that first learns broad priors, then adapts to realistic scene conditions. On a new in-the-wild benchmark, ShapeR substantially outperforms strong baselines while avoiding fragile 2D segmentations.

Main achievement: Showing that multimodal conditioning plus rectified flow on VecSet latents can robustly deliver object-centric, metric 3D reconstructions under real-world casual captures—closing the gap between pretty generative shapes and practical, scene-consistent geometry.

Future directions: Integrate detection and reconstruction end-to-end; extend to textured, relightable assets; push toward single-image metric 3D through monocular metric estimators; broaden to outdoor/large-scale scenes; and shrink compute for mobile devices. Expanding and standardizing in-the-wild benchmarks will help the field measure real progress.

Why remember this: ShapeR is a concrete step toward unifying generative 3D modeling with metric scene reconstruction, turning everyday videos into accurate 3D building blocks for AR, robotics, and content creation—without needing studio captures or perfect masks.

Practical Applications

•AR interior design: place and resize furniture accurately in your home from a casual room video.
•Robotics grasping: reconstruct metric shapes of tools and objects for reliable pick-and-place.
•Warehouse inventory: scan aisles casually to build 3D catalogs of items with correct sizes.
•Insurance and claims: document damaged rooms with object-level, metric reconstructions.
•Real-estate walkthroughs: convert tours into object-centric, metrically consistent 3D listings.
•Game/VR asset creation: quickly capture real props and convert them into clean 3D meshes.
•Museum digitization: record exhibits during normal tours and reconstruct objects robustly.
•Education and labs: build accurate 3D models of classroom setups or experiments from videos.
•Retail try-before-you-buy: reconstruct products on shelves to preview them at home in AR.
•Home organization: auto-map and measure household items to plan storage or renovations.

Version: 1