HY3D-Bench: Generation of 3D Assets
Key Summary
- •HY3D-Bench is a complete, open-source “starter kit” for making and studying high-quality 3D objects.
- •It cleans and standardizes 252k real 3D assets so they are training-ready, including watertight meshes and multi-view pictures.
- •It adds 240k objects that are neatly split into parts, so models (and people) can control and edit pieces like wheels, doors, or handles.
- •It synthesizes 125k extra 3D assets using AI to cover rare, long-tail categories you almost never find in real datasets.
- •It supplies a clear benchmark: fixed train/val/test splits, standard metrics, baselines, and model weights for fair comparisons.
- •Its data pipeline fixes messy geometry, aligns orientations, makes meshes watertight, and samples points smartly for better learning.
- •The part pipeline uses connected components, merges tiny fragments, renders RGB and part-ID masks, and watertightens parts and wholes.
- •The synthesis pipeline is Text-to-Text → Text-to-Image (with LoRA fine-tuning) → Image-to-3D using HY3D-3.0 for clean, useful training data.
- •A smaller Hunyuan3D-2.1-Small model trained on HY3D-Bench matches larger systems on standard image-to-3D metrics, proving data quality matters.
- •This ecosystem lowers the barrier for 3D research and boosts applications in robotics, AR/VR, games, and digital content creation.
Why This Research Matters
Clean, ready-to-train 3D data speeds up research and helps smaller teams compete without huge preprocessing budgets. Part-level structure unlocks precise editing tools, smarter robotic manipulation, and educational interfaces where students explore how objects are built. Long-tail synthetic assets mean models can handle unusual objects found in homes, hospitals, factories, and stores. Fair benchmarks with fixed splits and metrics make results trustworthy, so progress is real and reproducible. Altogether, HY3D-Bench helps bring safer robots, richer AR/VR, faster game production, and more accessible 3D creation to everyday life.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook): Imagine cleaning your room after a big birthday party: toys are everywhere, some are broken, and pieces from different sets are mixed in one box. It’s hard to play until you sort, fix, and label everything.
🥬 Filling (The Actual Concept): 3D perception is a computer’s ability to understand shapes, sizes, and positions of things in 3D. How it works:
- Look at 3D data (meshes, point clouds, images from many views).
- Find patterns that describe surfaces, edges, and volumes.
- Decide what the object is and where its parts are. Why it matters: Without solid 3D perception, a robot can’t pick up objects safely and a generator can’t build realistic shapes. 🍞 Bottom Bread (Anchor): A warehouse robot recognizing the difference between a mug’s handle and its body so it grips the right spot.
🍞 Top Bread (Hook): You know how a kid who reads tons of books can write new stories in that style?
🥬 Filling: Generative models are algorithms that learn from examples and then create new, similar things. How it works:
- Study many examples (here: lots of 3D objects).
- Learn a hidden “recipe” (a latent space) for making them.
- Use the recipe to produce new shapes on demand. Why it matters: Without good generative models, making fresh, realistic 3D assets is slow and manual. 🍞 Bottom Bread: A game engine asking the model to create a new chair that fits a “cozy cabin” vibe.
🍞 Top Bread (Hook): Think of a water bottle: if there’s a hole, it leaks.
🥬 Filling: Watertight meshes are 3D models with no holes or gaps—fully “sealed.” How it works:
- Detect surface gaps and self-intersections.
- Build a clean, closed surface.
- Verify inside/outside is consistent. Why it matters: Non-watertight models break physics, 3D printing, and reliable training signals. 🍞 Bottom Bread: A robot simulation needs a leak-proof mug mesh so liquid dynamics make sense.
🍞 Top Bread (Hook): When you buy shoes online, you check photos from different angles.
🥬 Filling: Multi-view renderings are pictures of the same 3D object from many camera angles. How it works:
- Place virtual cameras around the object.
- Render images under set lighting and camera poses.
- Store the images and camera info together. Why it matters: Without multi-view images, models can’t learn consistent 3D from 2D views. 🍞 Bottom Bread: Training a model to rebuild a toy car after seeing it from the front, side, and top.
🍞 Top Bread (Hook): In a snack shop, there are tons of chips but only a few packs of seaweed crackers.
🥬 Filling: Long-tail distribution means a few categories are common, while many others are rare. How it works:
- Count items per category.
- Notice most items clump in a few, leaving many with tiny counts.
- Fill the gaps or models won’t generalize. Why it matters: If rare categories are missing, robots and generators fail on unusual but important objects. 🍞 Bottom Bread: A household robot must handle a rare-shaped vase even if it trained mostly on bowls.
The world before: Big 3D collections like Objaverse gave researchers millions of assets. But like that messy post-party room, the data was inconsistent: different coordinate systems (Y-up vs Z-up), non-manifold geometry, broken or missing textures, and no clear part structure. People had to spend huge time and computing power just to make the data usable. Beginners got stuck before they even trained a model.
The problem: Models for 3D generation and perception need neat, consistent inputs (watertight meshes, clean multi-view images, and sometimes signed distance fields). They also need structure at the part level to enable fine editing and robotics tasks. And they need rare categories (the long tail) to work reliably in the real world. Most public data didn’t check these boxes.
Failed attempts: Prior efforts filtered for quality, fixed orientations, or added multi-modal samples, but usually stopped short of a fully training-ready package. Researchers still had to do heavy lifting: watertightening, point sampling, rendering for many views, and building fair evaluation splits.
The gap: A single, unified ecosystem that delivers (1) clean, watertight, and rendered training-ready assets; (2) part-level decomposition with both whole and per-part meshes and masks; and (3) synthetic, balanced coverage for rare categories—plus a standard benchmark with fixed protocols.
Real stakes: With clean, structured 3D, we can train better models faster—leading to safer robots, richer AR/VR, quicker game asset pipelines, and even easier 3D printing. HY3D-Bench aims to turn the messy toy box into a neatly labeled toolkit that anyone can use.
02Core Idea
🍞 Top Bread (Hook): Imagine a school art room where every brush, paint, and canvas is labeled and ready, and there’s a gallery wall to fairly judge everyone’s work.
🥬 Filling: HY3D-Bench is a unified, open-source ecosystem that makes 3D data training-ready, adds part structure, expands rare categories with synthetic assets, and standardizes evaluation. How it works:
- Clean and prepare 252k real assets: watertight meshes, multi-view renderings, consistent formats.
- Add 240k part-level decompositions: whole-object and per-part watertight meshes, plus RGB and part-ID masks.
- Synthesize 125k assets for long-tail coverage with Text→Image→3D.
- Provide fixed splits, metrics, baselines, and weights for fair comparisons. Why it matters: Without this, researchers waste time on data wrangling, get unfair benchmarks, and miss rare categories. 🍞 Bottom Bread: A student downloads HY3D-Bench and trains a new 3D model in days instead of months because the data “just works.”
The “Aha!” Moment (one sentence): If we standardize quality, structure, and diversity in one place—and lock in fair testing rules—3D generation research speeds up for everyone.
Multiple analogies:
- Kitchen: Prepped ingredients (clean meshes), recipe steps (renderings and masks), spice rack for rare flavors (synthetic data), and a taste test with rules (benchmark).
- Lego set: Bricks sorted by shape/color (parts), instruction booklet (protocols), special rare pieces included (long-tail), and a build contest with the same rules.
- Library: Books cleaned and cataloged (assets), chapters marked (parts), rare books added (synthetic), and a reading test that’s the same for all (benchmark).
Before vs After:
- Before: Each lab repeated the same painful mesh fixes, renders, and part splits, and compared results on different data with different metrics.
- After: Everyone starts from the same, clean base with parts and rare categories, trains with known configs, and compares apples-to-apples.
🍞 Top Bread (Hook): You know how taking apart a toy lets you repaint just the wheels without touching the body?
🥬 Filling: Structured part-level decomposition splits objects into meaningful parts you can control separately. How it works:
- Split by connectivity into initial components.
- Merge tiny fragments to reasonable part sizes.
- Keep per-part and whole meshes watertight; render RGB and part-ID masks. Why it matters: Without parts, it’s hard to edit, understand, or train robots to manipulate specific components. 🍞 Bottom Bread: A designer swaps a chair’s legs while keeping the seat and backrest unchanged.
🍞 Top Bread (Hook): Think of practice worksheets before an exam.
🥬 Filling: Synthetic data generation creates realistic “practice” 3D assets to fill category gaps. How it works:
- Text-to-Text (LLM) expands category descriptions.
- Text-to-Image (with LoRA) produces clean, centered views.
- Image-to-3D reconstructs detailed meshes. Why it matters: Without synthetic data, rare objects stay underrepresented, and models overfit to common categories. 🍞 Bottom Bread: Generating unusual medical carts or niche tools so a service robot isn’t surprised at a hospital.
🍞 Top Bread (Hook): Imagine a chef who invents new dishes from a list of desired flavors and a photo.
🥬 Filling: The AIGC synthesis pipeline is a three-step AI system (Text→Image→3D) for making diverse, useful 3D objects. How it works:
- Use an LLM to craft rich, accurate, varied descriptions.
- Fine-tune a text-to-image model (LoRA) for clean backgrounds and good angles.
- Convert images to high-quality 3D with HY3D-3.0. Why it matters: Without a controlled pipeline, images come with messy backgrounds and bad views that ruin 3D reconstruction. 🍞 Bottom Bread: A clean, three-quarter-view image of a kettle becomes a faithful 3D kettle model.
Why it works (intuition):
- Consistency: Standard formats and watertightness make training signals stable.
- Structure: Parts unlock fine-grained control and learning.
- Diversity: Synthetic long-tail assets stop models from being “chair-only smart.”
- Fairness: Fixed splits and metrics make results trustworthy and comparable.
Building blocks:
- Full-level processing (render/convert, filter, watertight+sample), 2) Part-level processing (split/merge/filter, render RGB+mask, watertight per-part), 3) Synthesis (Text→Image→3D with LoRA cues), 4) Benchmark (splits, ULIP/Uni3D metrics, baselines).
03Methodology
At a high level: Raw 3D assets + category list → [Full-level processing] → [Part-level processing] → [Synthetic pipeline] → Training-ready datasets + benchmark.
Full-level processing (for whole objects):
- What happens: Standardize formats/orientation, render multi-view images, filter low-quality data, make meshes watertight, sample points.
- Why this step exists: Messy inputs cause unstable training; clean, watertight shapes and rich views provide reliable supervision.
- Example: A guitar model with flipped axes and broken texture is fixed to Y-up, re-rendered from many views, watertightened, and sampled for learning.
Details (like a recipe):
- Data rendering and conversion
- What: Convert diverse assets to single-frame static meshes (PLY), align orientation, and render multi-view images in Blender (orthographic + perspective).
- Why: Consistent formats stop downstream code from breaking; multiple views teach models 3D consistency.
- Example: A Z-up blender scene is rotated to Y-up, then rendered from 24 views.
- Asset filtering
- What: Remove low-poly, texture-broken, and ultra-thin-structure objects.
- Why: Sparse geometry and bad textures confuse learning; thin sheets cause SDF “sign flips” and unstable training.
- Example: A mesh with 200 faces and missing UVs is filtered out; a detailed, well-textured one stays.
- Post-processing: watertight + sampling 🍞 Top Bread (Hook): Think of tracing the distance from any point in space to the closest surface, like measuring how far your finger is from a balloon.
🥬 Filling: A Signed Distance Function (SDF) gives a number at each 3D spot: negative inside, positive outside, zero on the surface. How it works:
- Define a grid of 3D points.
- Compute distance to the nearest surface, with a sign for inside/outside.
- The zero set outlines the object. Why it matters: SDFs let models learn clean shapes and extract meshes. 🍞 Bottom Bread: Turning an SDF of a teddy bear into a smooth mesh you can render and print.
- What: Compute an unsigned distance field (UDF) on a 512³ grid, extract a thin shell via Marching Cubes (ε = 1/512), tetrahedralize with Delaunay, then use graph cut labeling (inner/outer) to extract a watertight boundary (ConvexMeshing-style). Finally, sample points by mixing uniform-surface and edge-importance strategies.
- Why: Marching Cubes plus volumetric labeling seals holes; hybrid sampling captures both global shape and sharp features.
- Example: A teapot spout with tiny gaps becomes sealed; sampling focuses more points along the spout’s rim for detail.
🍞 Top Bread (Hook): Like connecting dots to reveal a smooth statue inside a block.
🥬 Filling: Marching Cubes is a method to turn a 3D grid of distances into a triangle mesh. How it works:
- Look at each little cube of the grid.
- Check where the surface crosses its corners.
- Place triangles to trace the zero boundary. Why it matters: Without it, you can’t turn distance values into usable surfaces. 🍞 Bottom Bread: Extracting the shell of a vase from its distance field.
🍞 Top Bread (Hook): Imagine stretching a net of triangles to cover a set of points without overlaps.
🥬 Filling: Delaunay triangulation builds tetrahedra that nicely fill space between points. How it works:
- Take sampled points on the shell.
- Create tetrahedra that avoid skinny shapes.
- Form a volume mesh for labeling inside vs outside. Why it matters: Good tetrahedra make later optimization stable. 🍞 Bottom Bread: Making a sturdy 3D net around a toy car so you can classify inner/outer cells.
🍞 Top Bread (Hook): Think of coloring regions inside a maze with two labels: “in” or “out,” minimizing messy borders.
🥬 Filling: Graph cut optimization assigns inside/outside labels to tetrahedra to find a clean, watertight surface. How it works:
- Build a graph where nodes are cells and edges connect neighbors.
- Define costs for being inside or outside.
- Cut the graph to minimize total cost, yielding a sealed boundary. Why it matters: Without it, tiny leaks remain and the mesh isn’t truly watertight. 🍞 Bottom Bread: A leaky mug mesh becomes a sealed, printable mug.
Part-level processing (for components):
- What happens: Split by connectivity, merge tiny fragments, verify balanced parts, render RGB and part-ID masks, watertight whole and per-part.
- Why this step exists: Parts enable fine-grained control, editing, and robotics grasps.
- Example: A bike splits into frame, wheels, seat, and handlebar, each watertight and mask-rendered.
🍞 Top Bread (Hook): If two Lego bricks aren’t glued, you can separate them; if they’re fused, they’re one piece.
🥬 Filling: Connected Component Analysis finds chunks of a mesh that are actually separate pieces. How it works:
- Treat the mesh as a graph of connected faces.
- Group faces that are linked.
- Each group is a candidate part. Why it matters: Without it, you either get one big blob or too many tiny fragments. 🍞 Bottom Bread: Automatically isolating a chair’s four legs from the seat-back assembly.
- Filters: Remove assets with too few parts (≤1) or too many (>50), reject extreme imbalance (one part >85% area), and drop cluttered tiny fragments.
- Rendering: Produce multi-view RGB and synchronized part-ID masks.
- Watertightening: Seal both whole-object and per-part meshes.
Synthetic data generation (AIGC):
- What happens: Expand category texts (LLM), generate clean images (LoRA-tuned T2I), then reconstruct 3D (HY3D-3.0).
- Why this step exists: Real datasets miss many rare, useful categories; synthesis fills the long tail.
- Example: From “foldable camping stove” description → clean 3/4-view image → detailed 3D stove.
🍞 Top Bread (Hook): Like asking an art assistant to focus on the foreground and ignore background clutter.
🥬 Filling: LoRA fine-tuning lightly adapts a text-to-image model to produce centered objects with simple backgrounds and good angles. How it works:
- Keep the big model mostly frozen.
- Train small adapter layers on curated prompts.
- Nudge outputs toward clean, training-friendly views. Why it matters: Without LoRA, you get messy scenes that break 3D reconstruction. 🍞 Bottom Bread: Generating a clean, centered toaster on a plain backdrop.
Models trained on HY3D-Bench: 🍞 Top Bread (Hook): Imagine zipping a big 3D object into a small, smart suitcase you can later unpack.
🥬 Filling: A 3D Variational Autoencoder (3D VAE) compresses shapes into a compact latent set and decodes them back. How it works:
- Encode sampled points (with normals) into latent vectors (a VecSet).
- Sample from a learned distribution.
- Decode to an SDF and extract the mesh. Why it matters: Without a VAE, generation can be slow or unstable. 🍞 Bottom Bread: Packing a motorcycle into a few hundred numbers, then reconstructing it as a mesh.
🍞 Top Bread (Hook): Picture guiding a crumpled paper back to its neat, original sheet step-by-step.
🥬 Filling: Diffusion/flow-matching models learn to reverse noise into data, conditioned on image features, to generate 3D latents. How it works:
- Start from noise in latent space.
- Predict the velocity or noise to move toward real data.
- Use image embeddings (e.g., DINOv2) to guide the shape. Why it matters: Without this, turning an image into 3D latents is much harder. 🍞 Bottom Bread: From a photo of a lamp, the model flows noise into a clean lamp latent, then decodes the mesh.
Secret sauce:
- End-to-end readiness: Assets come pre-watertightened, rendered, and sampled.
- Part-first design: Per-part meshes and masks enable controllable, fine-grained tasks.
- Long-tail coverage: Synthetic pipeline fills gaps in rare categories.
- Standard benchmark: Shared splits/metrics make results directly comparable.
04Experiments & Results
The test: The team trained a smaller Hunyuan3D-2.1-Small (832M parameters) using HY3D-Bench’s full-level data to see if clean, standardized inputs boost image-to-3D quality. They measured how well generated meshes match input images using Uni3D and ULIP—two alignment metrics that check visual-language-3D consistency.
🍞 Top Bread (Hook): Think of a teacher grading whether your drawing really matches the photo you copied.
🥬 Filling: ULIP and Uni3D are alignment metrics that score how closely a 3D result corresponds to an image (and text) description. How it works:
- Encode the image and the 3D object into a shared feature space.
- Measure similarity scores.
- Higher is better alignment. Why it matters: Without alignment metrics, we can’t quantify if the generated 3D truly matches the input image. 🍞 Bottom Bread: A high ULIP score for a “red kettle” means the mesh looks like the red kettle photo, not a blue teapot.
The competition: They compared against strong open-source methods—Michelangelo, Craftsman, Trellis, and the larger Hunyuan3D 2.1—on a fixed test set (400 objects) with the same rules and data.
Scoreboard (with context):
- Trellis and full Hunyuan3D 2.1 are large models at the top of the field. The small model trained on HY3D-Bench reached nearly the same Uni3D and ULIP scores as these big models, and beat the similarly sized Craftsman. Think of this like getting a solid A when the class stars are A+, and scoring higher than peers your size—a strong sign the training data is excellent.
- Numbers in the paper show the small model at 0.3606 Uni3D-I and 0.2424 ULIP-I versus 0.3636/0.2446 for the large Hunyuan3D 2.1 and 0.3641/0.2454 for Trellis. Michelangelo trails more, while Craftsman is close but lower. The take-away: better data narrows the gap with bigger models.
Training recipe highlights:
- Progressive token schedule (512 → 4096 tokens) with staged batch sizes and image resolutions helped stabilize learning and improved fidelity over time.
- Architecture simplification (no MoE, smaller channels) cut compute but, thanks to high-quality data, preserved competitive results.
Surprising findings:
- Data quality mattered more than raw parameter count in closing the performance gap.
- Excluding extremely thin structures, though conservative, improved overall stability and final scores.
- Clean, LoRA-guided images in the synthetic pipeline noticeably improved downstream 3D reconstructions, underscoring how upstream choices ripple into final metrics.
Qualitative results: Visual comparisons show sharper edges, fewer artifacts, and strong faithfulness to input images. The watertight, standardized meshes also render well from new views, showing learned 3D consistency.
Big picture: HY3D-Bench’s careful curation and standardization let a smaller model punch above its weight. This suggests future gains may come as much from better data ecosystems as from bigger models.
05Discussion & Limitations
Limitations:
- Scope gaps: The dataset focuses on static objects; dynamic or articulated motion is future work. Materials are represented via renderings but full PBR/material ground truth for every asset is not always included.
- Biases: Even with 125k synthetic items, some real-world long-tail categories may still be under-covered or stylized.
- Geometry constraints: Very thin structures were filtered out to improve stability, which may limit training for ultra-fine geometry.
- Part semantics: Parts are consistent by connectivity and merging rules, but universal semantic names or hierarchies (e.g., “left door vs right door”) aren’t guaranteed everywhere.
Required resources:
- Storage: Hundreds of thousands of meshes, views, and masks require sizable disk space.
- Compute: While training-ready, large-scale training (e.g., 3D VAEs or diffusion) still needs GPUs and time; however, users skip the heavy preprocessing step.
- Tooling: Familiarity with standard 3D tools (e.g., Blender) and DL frameworks is useful for custom experiments.
When NOT to use:
- If your task is non-watertight reconstruction (e.g., cloth simulation with open boundaries), the watertight bias may be a mismatch.
- If you need dynamic or deformable training data (e.g., moving characters with skeletal animations), this static set won’t cover motion.
- If your domain is highly specialized (e.g., medical CT-based anatomy), general consumer/product shapes may not transfer well.
Open questions:
- How to extend from static to dynamic 3D with consistent part semantics over time?
- How to enrich material/PBR labels and physics properties for simulation-heavy domains?
- How to standardize universal part taxonomies across categories for stronger compositional generation?
- How to scale synthetic pipelines while guaranteeing photoreal geometry-texture alignment and reducing any style drift?
- What are the best unified metrics beyond ULIP/Uni3D to capture geometry fidelity, material accuracy, and part-level correctness together?
06Conclusion & Future Work
Three-sentence summary: HY3D-Bench delivers a unified, open-source ecosystem for 3D generation by cleaning 252k assets into training-ready, watertight meshes with multi-view renderings, adding 240k part-level decompositions, and synthesizing 125k long-tail assets. It also sets fair, fixed benchmarks with standard splits, metrics, and baselines. Training a smaller Hunyuan3D-2.1-Small on this data achieves near state-of-the-art alignment scores, proving that great data can rival sheer model size.
Main achievement: Turning messy, inconsistent 3D data into a complete, structured, and diverse resource that researchers can use immediately—no heavy preprocessing—while providing a shared, trustworthy scoreboard for progress.
Future directions: Expand to dynamic and articulated assets with temporal consistency, enrich material/physics annotations, grow semantic part taxonomies, and keep scaling the synthetic pipeline to close remaining long-tail gaps. Explore unified metrics that jointly measure geometry, appearance, and part correctness.
Why remember this: HY3D-Bench shows that in 3D AI, quality data engineering plus structure and diversity can be as transformative as new model tricks. It lowers the barrier to entry, speeds up innovation, and points the field toward fairer, more reproducible science—so more people can build better 3D worlds, faster.
Practical Applications
- •Train a 3D VAE or diffusion model directly on the provided watertight meshes and multi-view images without custom preprocessing.
- •Build a part-aware editor where users swap, scale, or recolor individual components (e.g., chair legs) using part meshes and masks.
- •Create robotics grasping datasets by focusing on functional parts (handles, knobs) and testing across rare, synthetic categories.
- •Prototype an image-to-3D pipeline using the benchmark’s standard splits and report ULIP/Uni3D scores for fair comparison.
- •Generate balanced datasets for simulation by mixing real and synthetic assets to cover long-tail industrial tools or medical carts.
- •Teach 3D concepts in classrooms with clean examples: render views, inspect parts, and print watertight meshes on school 3D printers.
- •Benchmark new methods (e.g., improved watertightening or sampling) by plugging into the fixed evaluation suite.
- •Design AR/VR assets quickly: start from part-level templates, customize appearances, and export sealed meshes for engines.
- •Develop part-aware generative models (e.g., “add drawers to this cabinet”) using synchronized RGB+part-ID training pairs.
- •Test domain transfer: fine-tune on synthetic subsets to improve performance on rare categories in your target application.