SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling

Elisabetta Fedele; Francis Engelmann; Ian Huang; Or Litany; Marc Pollefeys; Leonidas Guibas

SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling

Intermediate

Elisabetta Fedele, Francis Engelmann, Ian Huang et al.12/5/2025

arXiv PDF

Key Summary

•SpaceControl lets you steer a powerful 3D generator with simple shapes you draw, without retraining the model.
•You can feed in rough blocks (like superquadrics) or full meshes plus a short text, and get a detailed, textured 3D asset out.
•A single control knob (tau) lets you slide between 'perfectly follows my shape' and 'looks extra realistic.'
•It plugs into modern 3D generators (like Trellis) at test time by editing their hidden (latent) space.
•Compared to training-based and optimization-heavy methods, SpaceControl is faster to use and more faithful to your input geometry.
•User studies and numbers show it better matches the intended shape while keeping high visual quality.
•You can also add an image to guide the style and textures while the geometry follows your shapes.
•An interactive UI lets you edit superquadrics live and instantly turn them into textured 3D assets.

Why This Research Matters

SpaceControl gives creators a direct, fast way to turn rough 3D sketches into finished, textured assets, reducing iteration time dramatically. Designers can lock in exact dimensions while still getting natural, believable details, improving both productivity and quality. Game and AR/VR teams can fill scenes with assets that fit perfectly, rather than endlessly tweaking results from vague prompts. Educators and hobbyists can learn 3D by shaping simple primitives and watching them blossom into realistic models. Because it needs no retraining, studios can deploy it immediately with existing pre-trained backbones. The tau control makes the art–engineering trade-off explicit and easy to dial in. Overall, it closes the gap between precise intent and generative creativity in day-to-day workflows.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re building a Lego city. You have a pile of amazing bricks (the AI’s skills), but you can only shout vague instructions like “Make a cool house!” You’ll probably get something neat, but not exactly the shape you pictured.

🥬 The Concept: 3D generative modeling is when computers automatically create 3D objects. How it works (big picture):

A model learns patterns of geometry and texture from lots of examples. 2) At generation time, it uses noise and guidance (like text) to build a new shape. 3) It outputs meshes, radiance fields, or 3D Gaussians. Why it matters: Without it, making 3D assets is slow and expensive; with it, we can fill games, AR/VR, and design tools with content quickly.

🍞 Anchor: Think of typing “a wooden chair” and the computer producing a 3D chair you can spin around on screen.

🍞 Hook: You know how a friend who’s great at drawing still needs your sketch to get your exact idea? Just saying “draw a fancy chair” isn’t precise enough.

🥬 The Concept: Pre-trained generative models are expert “artists” that already learned from huge datasets. How it works: 1) Train once on tons of shapes. 2) Freeze the weights. 3) Reuse the model for many tasks. Why it matters: You don’t want to retrain for every new control or category; pre-trained models generalize well and are efficient to reuse.

🍞 Anchor: Like a chef who’s cooked thousands of dishes—give them a hint, and they’ll whip up something great without relearning how to cook.

The world before SpaceControl: Most 3D generators listened to text prompts or to images. Text is flexible but fuzzy about geometry (“tall backrest” means what, exactly?). Images can show shape but are awkward to edit and hard to make perfectly aligned poses or proportions. Artists lacked a simple way to say, “Make it exactly this tall and this wide, with armrests here.”

🍞 Hook: Picture telling a robot to place furniture: “Put the couch somewhere comfy” vs. “Put it exactly 2 meters long, with two pillows at the corners.” The second is crisp and spatial.

🥬 The Concept: Controllability means steering both what the object is and its precise geometry. How it works: 1) Give the model a control signal (text, image, or shape). 2) The model fuses this signal with its knowledge to generate the asset. 3) A good system lets you vary the strength of the control. Why it matters: Without solid control, you spend time re-generating and editing; workflows slow down.

🍞 Anchor: A designer needs a chair with a 45 cm seat height to fit a table set. Hitting that dimension first try saves hours.

What people tried:

Training-based control (e.g., fine-tuning for voxels or primitives): Powerful but costly, can overfit or reduce generalization, and often needs category-specific training.
Guidance-based optimization (project shapes into 2D views and optimize): Training-free but slow and indirect, as you control images of the object rather than the object in 3D itself.
Detail-enrichment tools: Great if you already have a detailed mesh, but many creators start with rough blocks.

🍞 Hook: It’s like wanting to mold clay directly with your hands, but someone keeps asking you to describe the clay in a poem, or to trace it from photos instead.

🥬 The Concept: The gap was a fast, training-free way to directly inject 3D geometry into a modern 3D generator. How it works: 1) Accept simple or detailed 3D shapes as input. 2) Guide the model’s hidden space during generation. 3) Let users dial how strongly it follows the input. Why it matters: This gives everyday creators immediate, precise spatial control without months of training a new model.

🍞 Anchor: A game artist blocks out a sofa with a few rounded boxes, types “modern linen sofa,” slides a knob for fidelity, and instantly gets a textured asset that fits the room’s dimensions.

Real stakes in daily life:

Game teams can rapidly populate levels with objects that fit exact spaces.
AR apps can preview furniture that matches room dimensions precisely.
Product designers can iterate on forms quickly, from rough mockups to polished assets.
Educators and hobbyists can turn simple blocks into believable 3D creations, learning by doing.

🍞 Hook: Think of turning a quick doodle into a gallery-ready painting—just by telling the computer “keep this shape, now make it look real.”

🥬 The Concept: SpaceControl fills the missing piece: test-time, geometry-aware control that plugs into strong pre-trained 3D generators. How it works: It encodes your 3D shape into the model’s latent space, starts the denoising process from a point that “remembers” your geometry, and lets you balance shape faithfulness vs. realism. Why it matters: It preserves the model’s broad skills while giving you a direct handle on geometry.

🍞 Anchor: Now, the Lego city gets built exactly on your blueprint—fast, faithful, and beautiful.

02Core Idea

🍞 Hook: You know how tracing paper lets you lay a sketch over a final drawing to keep the outlines perfect while adding details? That’s the spirit here.

🥬 The Concept: The key insight is to inject your 3D geometry directly into the model’s hidden (latent) space at test time, and start denoising from there. How it works:

Encode your control geometry into the same latent space the model uses for structure. 2) Pick a time t_start that blends your geometry with noise (the tau knob). 3) Run the model’s normal denoising to a clean structure. 4) Generate appearance (texture/material) with text or an optional image. Why it matters: This needs no retraining, preserves the model’s general knowledge, and gives a smooth slider between faithfulness and realism.

🍞 Anchor: It’s like baking with a mold: pour batter (noise+control) into a mold (your shape), bake (denoise), then decorate (texture) to taste.

Three analogies to cement the idea:

GPS Waypoints: Your geometry is a waypoint list for the route (denoising). A higher tau means the car starts closer to the waypoints, keeping the path tight. Lower tau starts farther away, allowing scenic realism detours.
Cake Mold vs. Freehand: With a firm mold (high tau), the cake matches your form exactly. With a softer hint (low tau), the cake may puff into a more natural—but less exact—shape.
Tracing Then Coloring: First trace the lines you want (structure stage), then color and shade (appearance stage) using a text or image style.

🍞 Hook: Imagine if changing just one dial on your music app could move from “exact cover song” to “creative remix.”

🥬 The Concept: The tau control parameter sets how strongly the model obeys your geometry. How it works: 1) The model mixes your geometry latent with random noise. 2) Higher tau biases toward your geometry; lower tau lets the model improvise more. 3) You can try a few tau values to get your perfect balance. Why it matters: Without tau, you can’t flexibly trade exact shape vs. life-like realism in seconds.

🍞 Anchor: For a sofa: tau high to nail measurements; tau lower to get fluffier cushions that look extra natural.

🍞 Hook: Think of editing a sculpture’s outline first, then adding paint and patterns.

🥬 The Concept: Trellis separates structure generation from appearance generation; SpaceControl taps both. How it works: 1) Structure: use your geometry to set the structure latent starting point and denoise to a crisp occupancy grid. 2) Appearance: sample features at active voxels and denoise with text or an optional image to get textures/materials. Why it matters: Splitting structure and appearance makes geometry control clean and lets style be adjusted independently.

🍞 Anchor: A “wooden chair” keeps its proportions from your blocks (structure), while image conditioning adds “floral upholstery” (appearance).

Why it works (intuition, no equations):

Rectified flows move latents from noise to clean samples along smooth velocity fields. If you start the trip closer to your target geometry (via the encoded control), the path naturally preserves those spatial cues.
Because the base model already learned a broad distribution of plausible shapes, the final result stays realistic even as we bias structure toward your control.
Separating structure from appearance avoids entangling pose and textures, so one can be strict (shape) while the other stays flexible (style).

Before vs. after SpaceControl:

Before: Vague text-only control, indirect image control, slow optimization, and retraining for new inputs.
After: Direct 3D control at test time, no training, quick iterations, adjustable strength, and consistent high-quality results.

🍞 Hook: You know how LEGO instructions first show the frame (structure) and then the colorful bricks (appearance)?

🥬 The Concept: Building blocks of SpaceControl are: (a) geometry encoder, (b) latent starting point with tau, (c) structure denoising, (d) appearance denoising with text/image, (e) decoders to meshes, radiance fields, or 3D Gaussians. How it works: Each step keeps your geometry signal alive while letting the model add learned realism. Why it matters: A modular recipe means it can plug into other two-stage 3D generators too.

🍞 Anchor: Swap the final decoder to get the asset format you need: mesh for games, radiance fields for neural rendering, or Gaussians for fast previews.

03Methodology

High-level recipe: Input (geometry + optional text/image) → Encode into latents → Mix with noise at chosen start time (tau) → Denoise structure → Decode occupancy grid → Sample appearance latents at active voxels → Denoise appearance with text/image → Decode to your favorite 3D format.

🍞 Hook: Imagine building a snowman: roll the big shapes first (structure), then add scarf and buttons (appearance).

🥬 The Concept: Stage 1—Structure generation with spatial control. How it works (step-by-step):

Prepare the geometry control: You can use superquadrics (simple parametric shapes) or a full mesh. Scale/align to the model’s canonical space and voxelize to a 64×64×64 occupancy grid.
Encode geometry: Feed the binary grid into Trellis’s pre-trained structure encoder to get a compact latent volume (16×16×16×8). This puts your geometry in the same space the generator understands.
Choose tau (control strength): This sets the time t_start at which we initialize the latent with a mix of noise and your encoded geometry. Lower tau = closer to noise; higher tau = closer to your geometry.
Initialize latents: Combine random noise with your encoded geometry using the model’s forward process at t_start.
Denoise with the pre-trained Structure Flow Model: From t_start down to zero, apply the rectified flow steps; use text features (CLIP) to guide semantics (e.g., “wooden chair with armrests”).
Decode structure: Use the pre-trained decoder to turn the clean structure latent into a binary occupancy grid (the scaffold of your 3D object). Why it matters: Starting from the geometry latent keeps your spatial intent intact; without this, shape drifts toward whatever the model finds likely, not what you need.

🍞 Anchor: Feed in a few rounded blocks for a sofa and the text “modern linen sofa”—the output structure keeps your lengths and cushion layout.

🍞 Hook: Think of stage 2 as choosing fabrics and paint for the sculpture you just shaped.

🥬 The Concept: Stage 2—Appearance generation with text and/or image. How it works (step-by-step):

Identify active voxels: From the occupancy grid, collect the coordinates inside the object (L active voxels).
Sample appearance latents: For each active voxel, sample a small feature vector from noise.
Add conditioning:

Text: Encode with CLIP to get a semantic style (e.g., “floral fabric with wooden legs”).
Optional image: Encode with DINOv2 to capture texture/color/style from a reference photo.

Denoise appearance: Run the pre-trained Appearance Flow Model to turn noisy features into clean appearance features aligned with your text/image.
Decode output: Choose a format—meshes for games/engines, radiance fields for view synthesis, or 3D Gaussians for fast previews. Why it matters: Separating structure and appearance lets you strictly control shape while flexibly styling the look. Without it, changing fabric might accidentally warp the chair.

🍞 Anchor: Keep the same sofa proportions while swapping between “velvet teal” and a photo sample of “cream linen.”

🍞 Hook: You know the volume knob on a speaker? Tau is the volume knob for how loudly your shape speaks during generation.

🥬 The Concept: Tau—the realism vs. faithfulness dial. How it works:

Tau determines t_start (where we begin denoising).
High tau = start near your geometry latent: excellent adherence to shape, sometimes slightly less natural micro-details.
Low tau = start near noise: more model freedom, often extra-realistic flourishes but looser adherence. Why it matters: One dial gives artists instant creative control per asset, no retraining or complex settings.

🍞 Anchor: For a product prototype, turn tau up to hit exact dimensions; for a concept art pass, turn it down to get pleasant, realistic variations.

Concrete mini-examples (data-level):

Input: Two superquadrics shaped like a seat and backrest + text “A wooden chair with a floral cushion.” Tau=6. Process: Encoder → latent mix near geometry → structure denoise → decode occupancy → sample appearance latents → appearance denoise with text + optional floral image → mesh decode. Output: Chair matching seat/back dimensions, with floral texture.
Input: Mesh of a toy whale + text “Cartoon blue whale” and an image of a pattern. Tau=4. Process: Mesh→voxelize→encode→latent mix a bit closer to noise→structure denoise→decode→appearance denoise with both text and image. Output: Whale that keeps the broad shape but gains cute, slightly stylized features and the reference pattern.

Secret sauce (why this is clever):

It reuses the pre-trained encoder/decoder and flow models—no new training, no architecture changes.
It operates directly in 3D latent space for structure, avoiding slow multi-view optimization.
The tau dial exposes a simple, continuous control over a complicated trade-off artists constantly face.
Multi-modal conditioning (text + optional image) keeps geometry-true results while enabling rich, consistent textures.

What breaks without each step:

No voxelization/encoding: the model doesn’t “speak” your geometry’s language and can’t follow it closely.
No tau control: you’re stuck with either too rigid or too loose generations.
No structure/appearance split: texture edits can bend geometry or shapes can wash out textures.
No image option: hard to match a brand/style board precisely.

04Experiments & Results

🍞 Hook: Think of a science fair where each robot must build objects from a sketch—judges score accuracy to the sketch and how good the result looks.

🥬 The Concept: The tests measure two things: shape faithfulness to the input geometry and visual realism/quality. How it works:

Datasets: Chairs and tables from ShapeNet (categories some baselines were trained on) and Toys4K (unseen open-world toys).
Spatial inputs: Coarse primitives (superquadrics) and full meshes.
Baselines: Spice-E (trained on categories), SPICE-E-T (training-based adaptation on Trellis), and Coin3D (multi-stage guidance via images).
Metrics: • Chamfer Distance (CD): lower is better shape match. • CLIP-I: higher is better text-image alignment. • FID (rendered images) and P-FID (geometry via point-cloud features): lower is more realistic. Why it matters: These cover both “does it match my shape?” and “does it look good?”

🍞 Anchor: It’s like checking that your cake matches the pan shape (CD) and still tastes gourmet (FID/P-FID).

Scoreboard with context (at a representative tau, e.g., 6):

On Toys4K with primitive inputs, SpaceControl achieves CD≈14.0 (much lower/better than training-based and Coin3D), while keeping CLIP-I and FID/P-FID comparable to strong baselines. That’s like getting an A+ in matching the sketch while also keeping an A in presentation.
On mesh inputs, SpaceControl’s CD drops further (≈4.89), showing very tight adherence when you supply more detailed geometry, with competitive realism scores.
On chairs/tables (where some baselines had training advantages), SpaceControl still leads or stays highly competitive in CD, while keeping realism solid—showing that test-time control doesn’t sacrifice visual quality.

User study (52 participants, ~20 samples each):

Participants compared pairs on faithfulness, realism, and overall preference.
SpaceControl was preferred overall and for shape faithfulness against training-based baselines; realism remained competitive. Interpretation: Even untrained eyes noticed that outputs better matched the intended geometry without looking fake.

🍞 Hook: Like a mixing desk in a music studio, turning one knob visibly changes the song’s feel.

🥬 The Concept: Tau analysis—quantifying the realism vs. faithfulness dial. How it works:

Sweep tau across values and record CD (shape match) and FID (realism).
Observed trend: Increasing tau reduces CD (better match) but can slightly raise FID (a small realism cost). Why it matters: This confirms the control is smooth and predictable, letting users pick their sweet spot—often tau in [4,6] balances both nicely.

🍞 Anchor: For a precise product mockup, users picked higher tau; for concept art, slightly lower tau delivered extra natural flourishes.

Surprising findings:

Training-free control in 3D latent space can beat training-based methods on geometric faithfulness, even on categories they were tuned for.
Optional image guidance primarily influences appearance as expected, but with SpaceControl it becomes a practical, easy style-transfer tool for 3D, keeping geometry locked.

Qualitative highlights:

Non-axis-aligned inputs: SpaceControl aligns perfectly; others may drift.
Complex poses and unusual forms: SpaceControl holds geometry better, avoiding odd artifacts (e.g., duplicate heads or misplaced parts).
Editing sequences: With an image for style, you can make iterative geometry edits while preserving a consistent look.

05Discussion & Limitations

🍞 Hook: Even the best remote-control car has limits on range and battery—knowing them helps you drive smarter.

🥬 The Concept: Limitations and practical considerations. How it works / Why it matters:

Manual tau selection: Users pick tau by eye; this is flexible but not fully automatic. For batch or fully automated pipelines, per-instance tuning can be a hurdle.
Uniform adherence: One tau applies to the whole object. Some tasks want strict backs and flexible seats (part-aware control).
Dependency on a two-stage model: SpaceControl assumes a structure-first, appearance-second workflow (like Trellis/SAM 3D). Porting to single-stage models would need adaptation.
Input voxelization quality: Very thin parts or extremely high-res details may need careful scaling or higher resolution to encode faithfully.
Semantic ambiguity without text/image: Geometry alone can be class-agnostic; short prompts help disambiguate (chair vs. stool).

🍞 Anchor: If you want a chair with a rock-solid frame but playful upholstery, today you tweak tau and prompts globally; tomorrow’s version could let you tag the seat only.

Required resources:

A GPU for inference (to run structure/appearance flows and decoders).
The pre-trained base model (e.g., Trellis) and its encoders/decoders.
A simple editor or pipeline to provide superquadrics or meshes and to voxelize them.

When not to use:

If your use case demands fully automatic generation across thousands of items with no human in the loop and no room for per-item tau adjustments.
If you need CAD-grade, watertight, and parameterized engineering models with guaranteed tolerances; post-processing or specialized CAD tools may be better.
If your base model is not two-stage or lacks encoders compatible with your geometry inputs without additional work.

Open questions:

Auto-tau selection: Can we predict an ideal tau based on geometry complexity or prompt semantics?
Part-aware control: Can we assign different adherence per region (e.g., armrests strict, cushions flexible)?
Better geometric inputs: Beyond superquadrics and meshes, can skeletons or semantic parts improve control?
Wider base-model support: How easily can this generalize to other 3D backbones and higher resolutions?
Interactive feedback loops: Can the model suggest edits to primitives to reach a desired realism-faithfulness target faster?

06Conclusion & Future Work

Three-sentence summary: SpaceControl adds a simple, powerful way to steer 3D generation by directly injecting your geometry into a pre-trained model’s latent space at test time. With one tau knob, you control how closely the asset follows your shape versus how much natural realism the model adds, all without retraining. It works across simple primitives or detailed meshes and keeps textures on track with text or optional image guidance.

Main achievement: Demonstrating that training-free, direct 3D latent guidance can outperform training-based and optimization-heavy baselines on geometric faithfulness while preserving high visual quality.

Future directions: Automating tau selection, enabling part-aware/local control, supporting more input types (skeletons, landmarks), scaling to even higher-resolution structure, and broadening compatibility with more 3D backbones.

Why remember this: SpaceControl turns quick 3D sketches into production-ready assets, closing the gap between an artist’s intent (exact dimensions and layout) and a model’s creativity (realistic details), all with a single, intuitive control.

Practical Applications

•Rapid prototyping: Block out a product with superquadrics and instantly generate a realistic, dimension-accurate 3D asset.
•Level design: Sketch furniture or props to exact sizes for a game scene and generate matching assets on demand.
•AR staging: Create furniture that fits a real room’s measurements for virtual try-outs.
•Style-consistent edits: Keep a reference image to maintain brand materials while altering geometry (e.g., same fabric across different chair shapes).
•Catalog variation: Produce multiple variants of a base form by sweeping tau for controlled diversity.
•Education: Teach 3D concepts by starting from simple shapes and observing how structure and appearance interact.
•Industrial visualization: Align generated assets to CAD-like silhouettes for early concept reviews.
•Scene blocking: Roughly lay out a full scene with primitives and convert all items to meshes automatically.
•Character props: Precisely place handles, straps, or attachments via primitives, then texture with an image style.
•A/B testing: Quickly compare strict vs. loose adherence (different tau) to choose the best balance for production.

Version: 1