UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement

Tanghui Jia; Dongyu Yan; Dehao Hao; Yang Li; Kaiyi Zhang; Xianyi He; Lanjiong Li; Yuhan Wang; Jinnan Chen; Lutao Jiang; Qishen Yin; Long Quan; Ying-Cong Chen; Li Yuan

UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement

Intermediate

Tanghui Jia, Dongyu Yan, Dehao Hao et al.12/24/2025

arXiv PDF

Key Summary

•UltraShape 1.0 is a two-step 3D generator that first makes a simple overall shape and then zooms in to add tiny details.
•It fixes messy training data with a new watertight processing method that seals holes and thickens too-thin parts while keeping sharp details.
•The system separates 'where things are' from 'what they look like' so the model can focus on fine details without getting lost.
•A coarse vector-set model builds the big shape, and a voxel-based refiner—guided by precise positions (RoPE)—adds crisp geometry.
•Their data pipeline filters out bad models, normalizes poses, and uses a VLM plus a VAE test to keep only high-quality shapes.
•The refiner decodes to an SDF grid and extracts the final surface with marching cubes for clean, smooth meshes.
•Even with limited public data and modest GPUs, UltraShape matches or beats many open-source methods on geometric quality.
•It scales at test time: feeding more tokens produces noticeably sharper, richer geometry without retraining.
•You can even do training-free stylization: one image guides the coarse shape while another image sculpts fine style in the second stage.

Why This Research Matters

High-fidelity 3D assets are the building blocks of movies, games, AR shopping, and robot training. UltraShape 1.0 shows how to make these assets reliably and at scale using only public data and reasonable compute. Cleaner watertight meshes mean fewer surprises during rendering, 3D printing, or physics simulation. The coarse-to-fine strategy speeds creative workflows: block out a design fast, then add crisp detail where it counts. Test-time scaling lets teams dial quality up without retraining, saving time and money. Training-free stylization hints at flexible pipelines where structure and style can be mixed and matched on demand.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine building a LEGO castle. First, you snap together big bricks to outline the walls and towers. Later, you come back to add flags, windows, and tiny decorations. Easy when you do it in steps, right?

🥬 The Concept (3D Diffusion Framework): It's a way for computers to create 3D shapes step by step, starting from noise and improving them little by little. How it works: (1) Start with a rough guess, (2) repeatedly remove noise and fix the shape, (3) use learned patterns from many examples, (4) end with a believable 3D object. Why it matters: Without step-by-step improvement, the model would try to jump straight to perfection and usually fail—like trying to finish a LEGO castle in one handful of blocks.

🍞 Anchor: When you ask for a 3D chair, diffusion helps the model first get the seat-and-legs idea and then refine the legs to be sturdy and the backrest to be smooth.

The world before: 2D image generators were already great at making pictures, but 3D is harder. Good 3D data is rarer, comes in many messy formats, and needs to be watertight (no holes) to be truly useful for rendering, physics, or manufacturing. Another hurdle: 3D gets expensive fast because the space grows in three directions—double the resolution and memory can blow up to eight times more.

🥬 The Concept (Data Processing Pipeline): This is the cleanup crew for shapes before the model learns from them. How it works: (1) Gather raw 3D objects, (2) repair geometry and remove noise, (3) filter out low-quality or misaligned models, (4) store clean, standardized shapes for training. Why it matters: If the ingredients are bad, the cake won’t rise. Bad training data teaches the model bad habits.

🍞 Anchor: Like washing, peeling, and slicing veggies before cooking, a good data pipeline makes the training go smoothly and safely.

The problem: Existing watertight repairs were unreliable. Some methods estimate distances but mess up inside/outside signs, creating double walls. Others shoot rays to decide what’s inside, but get noisy in tight spots. Flood-fill methods can leak through cracks, turning shapes into thin shells.

🥬 The Concept (Watertight Processing Method): This is a new, sparse-voxel approach to seal holes and resolve inside/outside cleanly before making a surface. How it works: (1) Convert meshes into a sparse 3D grid, (2) automatically find and close holes, (3) detect open surfaces and thicken them so they’re solid, (4) reconstruct with signed distances to extract a clean manifold surface. Why it matters: Without watertightness, volumes are ambiguous, physics breaks, and later stages (like extracting surfaces) become unreliable.

🍞 Anchor: Think of patching a leaky water bottle until it can hold water again—now you can measure how full it is and carry it around without spills.

Failed attempts: Global vector-set models are compact but often miss fine details (surfaces look over-smoothed). Sparse voxel methods capture details better but demand many tokens and strain memory. Early coarse-to-fine ideas helped, but they still blended “where things are” with “what they look like,” making learning unstable.

🥬 The Concept (Variational Autoencoder, VAE): A VAE learns a compact code for a shape and can reconstruct it. How it works: (1) Encoder turns a 3D shape into a small set of numbers (latents), (2) decoder turns latents back into a shape, (3) training teaches the pair to compress and decompress well. Why it matters: Without a good VAE, you can’t store or refine shapes efficiently.

🍞 Anchor: It’s like zipping a file and unzipping it later—quick to store, quick to rebuild.

The gap: We needed a pipeline that (1) guarantees watertight, clean training shapes, (2) uses a two-step generator that first locks down the big structure, then adds details, and (3) explicitly separates “position” from “detail” so the model stops tripping over itself.

Real stakes: High-quality 3D assets power movies, AR try-ons, robot simulation, 3D printing, and video games. Cleaner, scalable generation means faster design cycles, cheaper content creation, safer robot training, and more believable virtual worlds.

02Core Idea

🍞 Hook: You know how you first hang a map on the wall (deciding where each city goes), and only afterward start drawing roads, rivers, and buildings? Planning the places first makes the drawing part much easier.

🥬 The Aha! Concept (Coarse-to-Fine Generation Pipeline): First nail the big shape, then add tiny details—while keeping positions fixed so the details don’t drift. How it works: (1) Stage 1 builds a coarse, globally correct shape using a compact vector-set model, (2) Stage 2 refines local geometry using voxel-based queries tied to precise coordinates, (3) positions are encoded with RoPE so the refiner knows exactly where it’s working, (4) the decoder outputs an SDF grid and marching cubes extracts the final, detailed mesh. Why it matters: Without fixing positions first, the model must invent both the layout and the details at once—hard and unstable.

🍞 Anchor: Like sketching a stick figure (coarse) and then drawing muscles, hair, and clothing (fine) exactly where the stick figure says.

Three analogies for the same idea:

Blueprint then bricks: Draft a house plan, then lay bricks exactly on the lines so windows end up where the plan says.
Coloring book: The outlines (coarse) come first, then you color inside the lines (fine details) without going outside.
GPS for sculpture: Place location pins around a statue-to-be; now you only need to chisel details at those pinned spots.

🥬 The Concept (Voxel-based Refinement): Voxels are tiny 3D cubes—like 3D pixels—that mark fixed spots in space where the model adds details. How it works: (1) Sample voxel queries near the coarse surface, (2) give each query a precise position, (3) let the diffusion refiner predict local shape features at those spots, (4) combine all predictions into a smooth field (SDF) and extract the surface. Why it matters: Without fixed voxel spots, the model has to guess both positions and details, making it easy to blur or misplace fine structures.

🍞 Anchor: It’s like decorating cupcakes on a muffin tray—each cupcake (voxel location) has a fixed place, so sprinkles and frosting stay organized.

🥬 The Concept (RoPE – Rotary Positional Encoding): RoPE tells the model the exact 3D coordinates of each voxel query in a way the transformer understands. How it works: (1) Convert XYZ positions into special angle-based signals, (2) mix them into attention layers, (3) preserve relative positions even when scaling, (4) help the model match details to the right spots. Why it matters: Without good position signals, details could float to the wrong place.

🍞 Anchor: Like GPS coordinates for every tiny workshop on a city map so deliveries (detail updates) always arrive at the right address.

Before vs after:

Before: Vector-only models gave smooth but bland geometry; voxel-only models were detailed but heavy and hard to scale.
After: Start with a compact global shape, then refine at anchored voxel spots—sharper details, better stability, less confusion for the model.

Why it works (intuition): The search space shrinks. By fixing “where,” the model only solves “what,” which is easier. Clean, watertight training data lowers noise. Voxel anchors plus RoPE make attention focus locally while still respecting the whole shape. An SDF decoder and marching cubes give clean, manifold surfaces.

Building blocks (with mini-concepts):

🥬 Concept (Diffusion Transformer, DiT): A transformer that denoises shapes over time. How: self-attention over tokens and time steps; Why: learns complex dependencies smoothly. 🍞 Anchor: Like a team of editors polishing a draft paragraph by paragraph.
🥬 Concept (Tokens): Small packets of information representing parts of a shape or image. How: group features into manageable chunks; Why: lets transformers attend and combine info efficiently. 🍞 Anchor: Lego pieces—you can build big things by snapping many small, standard blocks.
🥬 Concept (Signed Distance Field, SDF): A function that tells how far a point is from the surface, with a sign for inside/outside. How: negative inside, positive outside, zero on the surface; Why: makes surfaces smooth and easy to extract. 🍞 Anchor: Like a snow-depth map: negative means you’re below the snow (inside), positive above (outside), and zero at the top surface.
🥬 Concept (Marching Cubes): An algorithm to turn the SDF into a triangle mesh. How: scan each small cube, find where the surface crosses, stitch triangles; Why: without it, you’d have distances but no visible surface. 🍞 Anchor: Tracing a coastline on a terrain map to draw the exact shore.
🥬 Concept (Spatial Localization Decoupling): Separate “where” (fixed voxel positions) from “what” (local details). How: condition on voxel indices with RoPE; Why: removes confusion and speeds convergence. 🍞 Anchor: Arrange furniture first (where), then choose styles and colors (what) so rooms stay usable and pretty.

03Methodology

At a high level: Input image(s) → Stage 1 (coarse global shape with a DiT on vector-set tokens) → Voxelization & query sampling (with RoPE positions) → Stage 2 (voxel-conditioned DiT refiner with image cross-attention) → Decode to SDF grid → Marching cubes → Refined, watertight mesh.

Step 0: Data curation and watertight processing

What happens: Start from Objaverse’s ~800K models. Use a CUDA-parallel sparse voxel method to seal holes, detect open surfaces, and thicken thin sheets; then filter the data. Filtering includes VLM-based removal of trivial/noisy items, pose normalization (including a learned canonicalizer), and geometry checks using a pretrained VAE (models that reconstruct into fragments are removed). End with ~330K valid, ~120K high-quality models.
Why it exists: Broken, leaky meshes confuse SDF learning and make refinement unstable. Misposed objects make the model learn contradictory shape priors.
Example: A bike with tiny gaps between parts becomes watertight and slightly thickened so spokes don’t vanish; misrotated statues get turned upright so “up” is consistent.

🥬 Concept (Watertight Processing Method – recap): Resolve inside/outside before extracting the surface by working in a sparse voxel domain; automatically close holes and thicken open sheets. 🍞 Anchor: Think of stuffing foam into all cracks of a cardboard model, then wrapping it so air can’t get in.

Step 1: Render and sample for training

What happens: Use Blender Cycles to render 16 images per object (8 near-frontal, 8 random) at high resolution with varied lighting. Sample ~600K surface points for the VAE encoder, and ~1M supervision points (near-surface, curvature-aware, and free space) for SDF training.
Why it exists: Diverse views and careful point sampling preserve edges and capture both inside/outside behavior of the SDF.
Example: For a teapot, more samples at the spout and handle prevent them from smoothing away.

Step 2: Stage 1 – Coarse global structure with a vector-set DiT

What happens: Adopt Hunyuan3D-2.1’s DiT and VAE as a strong starting point. Generate compact latents capturing the overall shape. Focus on big geometry (silhouette, main parts) rather than tiny details.
Why it exists: A reliable coarse prior provides stable anchors for the next stage.
Example: For a car, Stage 1 ensures four wheels, the main body, and rough proportions are right.

🥬 Concept (Coarse-to-Fine Generation Pipeline – recap): First the frame (coarse), then the furnishings (fine), so details don’t fight the layout. 🍞 Anchor: Frame the house before you choose the doorknobs.

Step 3: Voxelization and query preparation with RoPE

What happens: From the coarse mesh, sample voxel queries on a fixed 3D grid (e.g., resolution 128) near the surface and in slightly offset zones. Encode each query’s 3D coordinates using RoPE, attach them to latent tokens.
Why it exists: Fixing positions forms a structured, discretized space that’s much easier to refine.
Example: For a chair, voxel queries cluster around the legs and seat edges, marking exactly where to sharpen corners.

🥬 Concept (RoPE – recap): Turn XYZ into rotational position signals that a transformer can use reliably at scale. 🍞 Anchor: Latitude/longitude for every tiny worksite so crews never get lost.

Step 4: Stage 2 – Voxel-conditioned diffusion refinement with image guidance

What happens: A DiT refiner (initialized from Hunyuan3D-2.1) denoises voxel-aligned latent tokens. Spatial info (RoPE) goes into self-attention layers. Image features (from DINOv2) enter via cross-attention. Token masking hides background pixels so the geometry aligns to the foreground object only. Training is progressive: more tokens and higher image resolution over time, stabilizing learning (e.g., 4096→8192→10240 tokens; final inference often uses 32768 tokens).
Why it exists: To sculpt crisp edges, thin structures, and local details while obeying the coarse layout.
Example: Given a sneaker photo, the refiner adds lace holes, sole grooves, and sharp seams where the coarse mesh only had smooth blobs.

🥬 Concept (Tokens – recap): Small, addressable chunks of info the transformer attends to. 🍞 Anchor: Many little sticky notes—each one a precise instruction for one spot.

Step 5: Off-surface VAE decoding to an SDF grid

What happens: The VAE is extended to handle slightly off-surface queries by adding small random perturbations during training (e.g., uniform in [−1/128, 1/128]). At inference, decode the denoised latents into an SDF on a regular grid.
Why it exists: Enabling off-surface predictions yields smoother, more reliable fields and better geometry extraction.
Example: A helmet’s air vents remain crisp because the decoder understands the volume around the surface, not just the exact surface points.

🥬 Concept (SDF – recap): Distance to surface with a sign for inside/outside, perfect for smooth isosurfaces. 🍞 Anchor: Like a contour map that tells you how high or low you are relative to sea level.

Step 6: Marching cubes to extract the final mesh

What happens: Run marching cubes on the SDF to get a triangle mesh. Output is the refined, watertight geometry.
Why it exists: We need a clean, usable surface for downstream tasks like rendering or simulation.
Example: The final dragon statue has crisp scales and no leaks.

🥬 Concept (Marching Cubes – recap): Convert invisible distance values into visible triangles. 🍞 Anchor: Drawing the shoreline from the tide map to get a clear coast.

Training and scaling details

Progressive schedules stabilize learning: VAE fine-tuning (55K steps), DiT refinement in stages with increasing tokens and image resolution. Voxel resolution 128 for train/infer; at test time, more tokens (e.g., 32768) improve detail. Experiments run on 8 NVIDIA H20 GPUs with a batch size of 32 and ~120K filtered samples.

Secret sauce

The key is spatial localization decoupling: fix where; learn what. This reduces the diffusion search space, speeds convergence, and allows test-time scaling. Clean watertight data boosts signal-to-noise, and image token masking keeps geometry aligned to the subject.

🥬 Concept (Spatial Localization Decoupling – recap): Separate position (fixed voxels) from details (learned content). 🍞 Anchor: Put the puzzle frame down first; then fill in pieces without re-guessing the edges.

04Experiments & Results

The test: They evaluated two things—(1) how well the watertight processing cleans and prepares meshes, and (2) how good the generated/refined shapes look compared with strong baselines. They also tested scalability: what happens if you give the model more tokens at test time than it saw during training?

Competition: For watertightening, they compared against Dora-style UDF methods, flood-fill (e.g., ManifoldPlus), and visibility-check approaches. For generation, they compared with open-source systems like Hunyuan3D-2.1, TRELLIS, TRELLIS.2, Hi3DGen, Direct3D-S2, and others; they also showed qualitative comparisons versus commercial tools.

Scoreboard with context:

Watertightening: Their method consistently closed large holes without the noisy, bumpy artifacts seen in visibility-check methods and avoided the double-shell problems of flood-fill/UDF heuristics. Think of this as getting an A for both cleanliness and completeness, while others alternated between neat-but-incomplete or complete-but-noisy (more like B- to C+).
Geometry generation: Results show sharper edges, cleaner thin parts, and better match to the input image’s silhouette and features. Compared to open-source baselines, UltraShape 1.0 looks like the student who not only answers correctly but also shows excellent handwriting and diagrams—overall an A, while many peers are around B.
Test-time scaling: Even though trained with a moderate token budget, the model generalizes to many more tokens during inference, producing visibly richer, crisper geometry. That’s like studying for a 10-question test but acing a 20-question version using the same knowledge.
VAE reconstruction scaling: Increasing tokens at inference improved reconstructions similarly—evidence that the representation and decoder gracefully scale.

Surprising findings:

Training-free stylization works: If the first stage uses image A (to set the coarse shape) and the second stage uses image B (to add fine style), the final object keeps A’s structure while borrowing B’s details—without retraining. This suggests the spatial anchors are robust and the detail-synthesis channel is flexible.
Sensitivity to RGBA quality: Poor masks or background leakage in the conditioning image degraded geometry, underlining how important good pre-processing is.

Takeaway: With only public data and modest compute, UltraShape delivers geometry quality comparable to commercial systems in many cases and surpasses several open-source methods. The consistent improvement from more tokens shows the approach is meaningfully scalable.

05Discussion & Limitations

Limitations:

Input dependence: If the conditioning image has bad segmentation (leftover background or shadows), the refiner may bake those mistakes into the geometry.
Resolution ceiling: Working at voxel resolution 128 balances quality and memory, but extremely tiny features or filigree might still be challenging.
Data scope: Training on public datasets like Objaverse is great for openness but may miss certain industrial CAD styles or niche categories.
Compute and memory: While efficient for its quality, pushing to very high token counts or resolutions still needs strong GPUs.
Geometry only: The report emphasizes geometry; high-fidelity materials/textures are outside the main contribution.

Required resources:

A curated, watertight dataset; Blender for rendering multi-view images.
GPUs (they used 8× NVIDIA H20) for fine-tuning and inference at higher token counts.
The CUDA-parallel watertight pipeline, VLM for filtering, and a pose canonicalization model.

When not to use:

Real-time mobile scenarios with tiny memory budgets.
Dynamic/deformable scenes where objects move over time (pipeline is for static assets).
Medical/engineering cases needing strict metrology-level accuracy or guaranteed manifold topology beyond the tested regime.

Open questions:

How far can voxel resolution and token counts scale before diminishing returns or instability?
Can end-to-end training bridge the two stages even more tightly without losing the decoupling benefits?
How to integrate materials (PBR), textures, and lighting consistency alongside geometry at this fidelity?
Better robustness to imperfect image masks—can we learn to ignore backgrounds automatically?
How well does the method generalize to out-of-distribution categories and CAD-like precision parts?

06Conclusion & Future Work

3-sentence summary: UltraShape 1.0 is a two-stage 3D generation system that first makes a strong global shape and then refines fine details at fixed voxel locations. A robust watertight-and-filter data pipeline cleans the training set, while RoPE-anchored voxel queries let the refiner focus on details instead of re-solving positions. The result is clean, high-fidelity geometry that scales with more tokens—even using only public data and modest compute.

Main achievement: Decoupling “where” from “what” in 3D diffusion by conditioning refinement on voxel queries with explicit positional encoding, backed by a watertight, high-quality dataset.

Future directions: Push voxel resolution and token counts further; jointly handle textures/materials; improve robustness to noisy image conditioning; explore end-to-end training and better stylization controls.

Why remember this: It shows a practical path to scalable, production-ready 3D geometry generation—clean data in, coarse shape first, then precise detail—an approach that’s both effective and efficient, and that plays nicely with limited training resources.

Practical Applications

•Rapid concept modeling for games and films: block out shapes, then refine to production-ready geometry.
•AR/VR product previews that need crisp, watertight meshes for accurate placement and lighting.
•Robot simulation assets with clean interiors for reliable collision and grasp planning.
•3D printing models that require sealed, manifold surfaces to avoid failed prints.
•Industrial design iterations where coarse forms are refined into detailed prototypes quickly.
•E-commerce 3D catalogs with consistent poses and high-quality geometry from varied sources.
•Education tools that convert images into study-ready 3D models for science and engineering classes.
•Heritage digitization cleanup: repair and refine scanned artifacts to robust, sharable meshes.
•Pre-visualization in architecture: fast massing (coarse) followed by façade detailing (fine).
•Style transfer for 3D: keep structure from one reference image and apply detailed style from another without retraining.

Version: 1