MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing

Xiaokun Sun; Zeyu Cai; Hao Tang; Ying Tai; Jian Yang; Zhenyu Zhang

MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing

Intermediate

Xiaokun Sun, Zeyu Cai, Hao Tang et al.1/1/2026

arXiv PDF

Key Summary

•MorphAny3D is a training-free way to smoothly change one 3D object into another, even if they are totally different (like a bee into a biplane).
•It uses a special 3D representation called SLAT and blends information inside attention layers instead of just mixing noises or images.
•Morphing Cross-Attention (MCA) keeps shapes and parts believable by separately attending to the source and the target before combining them.
•Temporal-Fused Self-Attention (TFSA) makes motion smooth by letting each frame look back at the previous frame’s features.
•An orientation correction step prevents sudden spins by snapping poses to the best-matching rotation relative to the last frame.
•Across 50 test pairs, MorphAny3D achieved the best plausibility (lowest FID) and aesthetics and nearly the best smoothness (PPL) compared to strong baselines.
•The method works without retraining, runs on a single GPU (about 30 seconds per frame), and generalizes to other SLAT-based 3D generators.
•It also enables cool extras like morphing only structure or only detail, mixing two targets, and 3D style transfer.
•Limitations mainly come from the base generator (Trellis), especially on ultra-fine details, but future stronger backbones may fix this.
•Overall, the key idea is smart, structured feature fusion inside attention to balance believable shapes with smooth, time-consistent motion.

Why This Research Matters

MorphAny3D lets artists and developers create smooth, believable 3D transformations across wildly different objects without retraining large AI models. This lowers costs and speeds up production for films, games, advertising, and AR/VR. Educators and scientists can visualize complex changes over time (like growth, assembly, or evolution) in clear, engaging ways. Product designers can preview transitions between prototypes or styles, improving creativity and communication. Because it runs on off-the-shelf generators and a single GPU, small teams can access high-end effects once limited to big studios.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine watching a cartoon where a teapot slowly becomes a turtle—no sudden jumps, just a smooth, magical change. That smooth in-betweening is called morphing.

🥬 The Concept (3D Morphing): 3D morphing is turning one 3D object into another through a sequence of believable, smooth steps. How it works (in simple terms):

Pick a start object (source) and an end object (target).
Generate many tiny steps between them.
Make sure each step looks like a reasonable blend, and the whole sequence feels smooth over time. Why it matters: Without good morphing, transitions look jerky or nonsense, breaking the illusion in movies, games, and AR/VR. 🍞 Anchor: Turning a chair into a car for a movie scene—good morphing keeps parts evolving logically (legs to wheels) and the motion steady.

The World Before: AI became great at making single 2D images look amazing, thanks to diffusion models. But moving from 2D to 3D morphing is hard: you need believable shape changes and textures that evolve together, and you must keep everything smooth as the camera and object move in space. Traditional 3D morphing mostly tried to match points between the two shapes (like connecting elbows to fenders) and then interpolate. This works within one category (cat-to-cat), but breaks down across different categories (cat-to-car), where there are no clear part matches.

🍞 Hook: You know how puzzles are easier when pieces obviously fit? Matching-based 3D morphing often tried to force-fit pieces from two totally different puzzles.

🥬 The Concept (Shape Correspondence): Many older 3D methods rely on finding which part of object A matches which part of object B. How it works: Estimate correspondences, then linearly blend positions and sometimes textures. Why it breaks: Across categories, correspondences can be wrong or missing (what’s the elbow of a teapot?), causing twisted or implausible shapes. 🍞 Anchor: Trying to morph a bee’s wing directly to a biplane’s propeller by force-matching pixels often yields broken geometry.

People also tried: (1) Do 2D morphing first, then lift each 2D frame into 3D. This often led to inconsistencies between frames because each 3D object was rebuilt independently—like redrawing a character from scratch every time. (2) Directly interpolate the generator’s noise or conditions (a popular trick in image morphing). But in 3D, that made no guarantee that shapes stay plausible or that motion is smooth.

The Gap: We lacked a way to use modern 3D generators intelligently—one that could fuse source and target information to get both (a) semantically plausible shapes and (b) temporally smooth motion—without retraining big models.

🍞 Hook: Imagine you have two expert teachers—one knows the source object and one knows the target. If you just average their voices, you get gibberish. But if you listen to each clearly and then combine their advice, you get wisdom.

🥬 The Concept (Structured Latent, SLAT): SLAT is a neat, grid-like way to store a 3D object’s structure and local details so that a model can attend to them. How it works: First, a Sparse Structure stage finds where important points live (the skeleton of the shape). Then a SLAT stage fills in local latent codes that describe geometry and texture at those points. Why it matters: SLAT is organized and explicit, so you can carefully blend or reference features instead of doing messy, unsafe mixing. 🍞 Anchor: Think of LEGO bricks (local latents) snapped onto a simple frame (sparse structure). Organized pieces make controlled transformations easier.

Real Stakes: Smooth, believable 3D morphing powers film transitions, game effects, educational visualizations (e.g., seed-to-tree growth), product design previews, and AR filters. Making this work across wildly different categories—without retraining huge models—saves time, money, and opens creative freedom.

02Core Idea

🍞 Hook: You know how when you bake a marble cake, you swirl chocolate and vanilla without fully mixing them, so each flavor stays tasty and the pattern looks beautiful?

🥬 The Concept (Key Insight): Instead of averaging everything, MorphAny3D separately attends to the source and the target inside the generator and then fuses the results, while also letting each new frame look back at the previous frame, plus fixing sudden spins with a simple orientation check. How it works (the recipe):

Start from SLAT features for source and target.
In cross-attention, compute attention to source and to target separately, then blend the two outputs by the morphing weight—this is Morphing Cross-Attention (MCA).
In self-attention, let the current frame borrow keys/values from the previous frame—this is Temporal-Fused Self-Attention (TFSA)—to keep motion smooth.
After the structure is estimated, test a few yaw rotations and pick the one closest to the last frame to avoid sudden flips (orientation correction). Why it matters: Naively mixing features can scramble semantics and cause jitter. Smart, structured fusion preserves believable parts and ensures steady transitions. 🍞 Anchor: Morphing a crab into a camera—MCA lets the lens evolve from the crab’s shell (plausible parts), TFSA keeps the shell-to-lens change smooth frame-to-frame, and orientation correction avoids random spins.

Three Analogies:

Two Coaches, One Player: Instead of averaging two coaches’ voices mid-sentence, the player listens to each coach separately (MCA), then makes a balanced move; the player also watches last game’s replay before this move (TFSA) to keep consistency.
Layered Tracing: Trace the source outline and the target outline on different sheets (separate attentions), then gently crossfade the traces; keep yesterday’s tracing as a guide to avoid wobble (TFSA).
GPS with Memory: Use directions from origin and destination separately, then combine (MCA); also look at your last known good path (TFSA); if the map suggests a sudden U-turn at mid-journey, double-check orientation (orientation correction).

Before vs After:

Before: Direct interpolation inside the generator made fuzzy, sometimes implausible shapes and jittery sequences. 2D-first pipelines looked good per frame but lacked 3D temporal consistency.
After: MCA preserves semantic correctness (parts evolve logically), TFSA enforces temporal smoothness, and orientation correction prevents mid-sequence flips—altogether yielding smooth, believable 3D morphs without retraining.

Why It Works (intuition):

Attention is an importance spotlight. If you blend the spotlight itself (keys/values) before aiming it, you point at the wrong places. MCA aims two clean spotlights first (source and target), then mixes their lit results. That keeps semantics intact.
Motion smoothness comes from short-term memory. TFSA borrows from the previous frame’s features (already plausible) rather than re-mixing raw source/target, so the path is steady.
Orientation glitches follow model pose biases. A tiny, cheap rotation check uses shape similarity to snap poses back in line.

Building Blocks (sandwich style for each):

🍞 Hook: Think of an organized toolbox. 🥬 SLAT (Structured Latent): A tidy, spatially-anchored set of codes for structure and detail. How: First find important spatial anchors (Sparse Structure), then attach local detail codes (SLAT) to them. Why: Orderly storage makes controlled, safe fusion possible. 🍞 Anchor: A pegboard with labeled tools—easy to find, easy to swap.
🍞 Hook: Two teachers giving you separate notes. 🥬 MCA (Morphing Cross-Attention): Compute attention to source and to target separately, then blend outputs by the morph weight. How: For the same queries, run cross-attention twice (source-only, target-only), then linearly combine results. Why: Prevents semantically mismatched features from being mixed too early. 🍞 Anchor: For a bee→biplane morph, wings attend to source wings and target propellers separately, then combine, avoiding background confusion.
🍞 Hook: Draw a flipbook by looking at the last page. 🥬 TFSA (Temporal-Fused Self-Attention): Mix current-frame self-attention with a small portion from the previous frame’s keys/values. How: Compute attention to current latents and to last frame’s latents, then blend with a small weight (e.g., 0.2). Why: Reduces jitter so shapes evolve steadily. 🍞 Anchor: The crab’s claw stays stable as it turns into a camera grip, not wobbling each frame.
🍞 Hook: Straightening a tilted photo before hanging it. 🥬 Orientation Correction: Try a few yaw rotations and pick the one closest to the last frame’s structure. How: Rotate candidate structures (0°, 90°, 180°, 270°), choose the one with smallest distance to the previous frame’s structure. Why: Stops sudden spins caused by pose biases in the generator. 🍞 Anchor: Midway through bee→plane, the model wants to flip; the check snaps it back to match the last frame’s heading.

03Methodology

At a high level: Input (source, target) → Prepare SLAT latents and conditions → Frame n initial mix (slerp by α) → Sparse Structure stage with MCA (+ orientation correction) → SLAT stage with MCA + TFSA → Decode to mesh/3DGS/NeRF → Output frame → Repeat for n=0…N.

Step 1: Inputs and Setup

What happens: You start with a source object and a target object. If they are real 3D assets, you invert them into the generator’s latent space to get their initial features (for both the Sparse Structure and SLAT stages) and image conditions. If they were generated by the same system earlier, you can reuse cached features.
Why this step exists: The generator (Trellis) works in its own feature space. Getting both objects into that space makes them comparable and morphable.
Example: For bee→biplane, we extract or reuse the bee’s and biplane’s latents and their guiding image features (e.g., DINOv2 features).

🍞 Hook: Like picking the first and last frames of a flipbook. 🥬 Concept (α-weighted initialization): Initialize each frame’s starting point by smoothly mixing the source and target noisy latents with spherical interpolation (slerp) using α from 0 to 1. How: α=0 gives the source; α=1 gives the target; in between, you get a gradual blend. Why: This gives every frame a reasonable starting guess without sharp jumps. 🍞 Anchor: At α=0.3, the bee still dominates; at α=0.7, the plane dominates.

Step 2: Sparse Structure (SS) Stage with MCA

What happens: The SS stage predicts where important voxels (anchors) are in 3D space for frame n. In cross-attention layers, we apply MCA: compute attention to source-only and to target-only conditions separately, then blend their outputs by α.
Why this step exists: Structure must stay plausible as parts move (e.g., bee head to cockpit). Early, clean attention avoids mismatched features.
Example: For the bee’s thorax evolving toward a fuselage, MCA makes the SS stage place anchors that still make sense for both creatures.

🍞 Hook: Two separate spotlights, then mix the lit results. 🥬 Concept (MCA in SS): Run cross-attention twice (source-only, target-only), then combine outputs = (1−α)source_attn + αtarget_attn. How: Keep queries the same, but prevent premature mixing of keys/values. Why: Avoids the common failure where mixed keys/values point to background or wrong parts. 🍞 Anchor: The model focuses on bee wings when evolving wings, and on biplane propeller regions when evolving the front—no accidental attention to the sky.

Step 3: Orientation Correction (after SS)

What happens: After predicting the sparse structure for frame n, we create four candidates by yaw-rotating the structure by 0°, 90°, 180°, and 270°. We choose the one most similar (lowest Chamfer Distance) to the previous frame’s structure and pass this corrected structure forward.
Why this step exists: Generators sometimes snap to biased poses (like sudden 180° flips) around mid-morphs. This step corrects that cheaply and reliably.
Example: At α≈0.5, the model tries to spin the morph. The 0° yaw candidate matches the previous frame best, so we keep it and avoid the flip.

🍞 Hook: Straighten the picture frame before adding details. 🥬 Concept (Orientation Correction): Try a few discrete yaw rotations and keep the one closest to the last frame. How: Measure structure-to-structure distance; choose the smallest. Why: Prevents jarring spins without hurting normal frames. 🍞 Anchor: The crab-camera morph stays facing forward instead of randomly spinning sideways.

Step 4: SLAT Stage with MCA + TFSA

What happens: Now we fill in local detail codes (geometry and texture) at the anchored voxels. We again use MCA in cross-attention to combine source/target semantic guidance cleanly. Additionally, in self-attention, we use TFSA: blend current-frame attention with a small portion (e.g., 20%) of the previous frame’s keys/values.
Why this step exists: MCA keeps details semantically believable; TFSA shares stable information across time, reducing flicker and jitter.
Example: The bee’s stripes fade into the biplane’s paintwork steadily; shell-like surfaces evolve into metal panels without frame-to-frame popping.

🍞 Hook: Draw today’s page of the flipbook while peeking at yesterday’s. 🥬 Concept (TFSA): Output = (1−β)Attn(current K,V) + βAttn(prev K,V), with small β. How: Keep using current queries; carefully fuse in previous-frame memory. Why: Encourages continuity across frames using already-plausible neighbors, not raw source/target mixes. 🍞 Anchor: The evolving cockpit windows don’t shimmer differently every frame; they smoothly appear and refine.

Step 5: Decoding to 3D and Rendering

What happens: Decode the completed SLAT into standard 3D forms (e.g., mesh, NeRF, or 3D Gaussian Splatting) and render views (RGB and normals) for evaluation or visualization.
Why this step exists: You need a concrete 3D asset for viewing, editing, or export to tools like Blender or Unreal.
Example: Export the bee→biplane sequence as meshes and render a turntable video.

Secret Sauce (why this pipeline is clever):

It doesn’t retrain the big model; it uses the model’s own structure (SLAT + attention) in a careful way.
MCA fixes a subtle but major problem: mixing keys/values too early scrambles attention. Separate, then blend.
TFSA adds a memory that’s local in time and trustworthy, curing jitter without sacrificing semantics.
A tiny, robust pose snap (orientation correction) solves big mid-sequence flips observed in real data.

What breaks without each step:

No α-initialization: frames may start too far apart and wander.
No MCA: local artifacts and wrong part focus (e.g., background-attended blobs) appear.
No TFSA: shape/detail pop and jitter frame-to-frame.
No orientation correction: mid-morph spins and sudden turns jar the viewer.
No decoding: you can’t use the results in real pipelines.

Concrete Mini Example (bee→biplane around α=0.5):

SS with MCA sets anchors that hint both at bee body and plane fuselage.
Orientation correction rejects a 180° flip and keeps heading steady.
SLAT with MCA grows front details toward a propeller while preserving bee’s center mass logic.
TFSA references α=0.49’s stable features so panels and stripes don’t flicker.
The decoded mesh shows a believable hybrid mid-form, ready for the next step.

04Experiments & Results

The Test: Researchers evaluated plausibility (how realistic and semantically correct frames look), smoothness (how steady changes are across adjacent frames), and overall visual appeal. They used 50 diverse source-target pairs, mixing real 3D assets and generator-made ones, and produced 50-frame sequences per pair.

🍞 Hook: Like grading a dance routine—are the moves clean (plausible), is the flow smooth (no stumbles), and is it pleasing to watch? 🥬 Concepts (Metrics):

Plausibility (FID): Lower is better; think of it as how close the generated frames feel to high-quality references.
Smoothness (PPL, PDV): Lower is better; they measure average perceptual change step-to-step and how even that change is.
Aesthetics (AS) and User Preference (UP): Do humans and vision-language models find it appealing and realistic? Why they matter: Together they capture both what each frame looks like and how well frames connect as a sequence. 🍞 Anchor: An 87% score alone means little—context shows if that’s amazing or average.

The Competition: Baselines covered multiple families—classic 3D matching and interpolation (3DInterp, SLATInterp, MorphFlow), 2D morphing lifted to 3D (DiffMorpher, FreeMorph), and direct interpolation of generator features (DirectInterp).

Scoreboard with Context:

MorphAny3D achieved FID 111.95 (best/lower is better), meaning its frames looked the most plausible overall; that’s like getting an A+ while others were around B to C.
Smoothness (PPL): MorphAny3D had 2.47, just slightly above the very best 2.41—so it’s practically neck-and-neck for smoothness while still being most plausible.
PDV (variance of step changes): 0.0006 (best), showing highly even transitions.
Aesthetics and User Preference: MorphAny3D ranked highest, indicating people consistently preferred its look and feel.

Head-to-Head Takeaways:

Matching-based 3D methods were smooth (linear interpolation tends to be), but plausibility was poor when categories differed—like trying to morph a kettle into a kangaroo by force-matching points.
2D-first pathways produced strong single frames but stumbled on temporal coherence once lifted to 3D, causing jitter.
Direct interpolation inside the generator mixed features in unhealthy ways (no structural or temporal guardrails), leading to middling results.
MorphAny3D balanced both worlds: top plausibility plus near-best smoothness, thanks to MCA + TFSA + orientation correction.

Surprising/Insightful Findings:

Orientation jumps clustered around mid-morph (α≈0.5) and were mostly yaw jumps of 90°, 180°, or 270°. This revealed a pose bias in the base generator. The lightweight orientation correction neatly solved it in practice.
A naive approach that fuses keys/values in both cross- and self-attention at once hurt plausibility, confirming that careful, staged fusion (MCA and TFSA) is crucial.
Ablations show each component helps: MCA drops FID substantially; adding TFSA improves PPL/PDV; orientation correction further reduces residual jitter.

Generalization:

The approach worked not only on Image-to-3D Trellis but also generalized to other SLAT-based models (e.g., Hi3DGen, Text-to-3D Trellis), producing smooth, high-quality morphs without retraining.

Concrete Number Highlights (from the paper):

FID: MorphAny3D 111.95 vs best baseline FreeMorph 164.68 (lower is better).
PPL: MorphAny3D 2.47 vs best 2.41 (MorphFlow)—MorphAny3D is very close while being much more plausible.
Ablations (FID/PPL/PDV): • KV-Fused CA: 125.47 / 3.82 / 0.0013 • MCA: 112.18 / 3.66 / 0.0010 • MCA + TFSA: 113.22 / 2.87 / 0.0007 • MCA + TFSA + Orientation Correction: 111.95 / 2.47 / 0.0006 (best overall balance)

05Discussion & Limitations

Limitations:

Ultra-fine geometric details can still show artifacts because the method inherits limitations of the base 3D generator (Trellis). Very thin structures (like spider silk or intricate filigree) may be imperfect.
Extreme cross-category cases with no intuitive semantic bridges can still produce odd intermediates (e.g., sponge→satellite) where no natural part-mapping exists.
The discrete orientation correction only tests a few yaw angles (0°, 90°, 180°, 270°). Rare flips around pitch/roll or odd angles might slip by.

Required Resources:

No retraining is needed, which is a huge win. Running image-to-3D Trellis with MorphAny3D took about 30 seconds per frame on a single NVIDIA A6000 (≈24 GB). Memory and runtime scale with output resolution and model size.

When NOT to Use:

If you require CAD-level precision or watertight meshes for engineering, current generators (and thus MorphAny3D) may not meet tolerances.
If style consistency over many minutes of animation is critical with strict lighting/material continuity (e.g., VFX pipelines with strict constraints), you may prefer fully supervised or retrained systems.
If your morph demands dramatic, artistic spins and flips, the orientation correction might fight that intent (though it can be disabled).

Open Questions:

Can we make orientation correction continuous and learn to predict optimal rotation per frame, not just pick among a few discrete candidates?
Could stronger 3D backbones (or improved SLAT designs) eliminate most fine-detail artifacts while keeping speed and training-free use?
Can we add explicit texture-evolution controls (e.g., preserve color palettes or materials) while still benefiting from MCA/TFSA?
How do we best support multi-target, multi-stage morphs (e.g., bee→leaf→plane) with consistent temporal memory and semantic control?

06Conclusion & Future Work

Three-Sentence Summary: MorphAny3D is a training-free method for 3D morphing that smartly fuses structured latent features inside attention, keeping shapes believable and motion smooth. It introduces Morphing Cross-Attention (separate, then blend) and Temporal-Fused Self-Attention (look back to the previous frame), plus a simple orientation correction that stops mid-sequence flips. Across many category pairs, it achieves state-of-the-art plausibility and near-best smoothness and generalizes to other SLAT-based generators.

Main Achievement: Showing that careful, structured fusion inside attention—rather than naive interpolation—unlocks high-quality, temporally coherent 3D morphing across categories without any retraining.

Future Directions: Upgrade the backbone generator for finer details; make orientation correction continuous and more general; add user controls for structure-only or detail-only transitions; extend to longer narratives (multi-target chains) and scene-level morphs.

Why Remember This: It reframes 3D morphing as a problem of how and where to fuse information in a structured latent space, proving that the right attention design and a pinch of temporal memory can turn complex cross-category transformations into smooth, cinematic sequences with zero extra training.

Practical Applications

•Film and TV transitions where one prop or creature morphs into another smoothly and believably.
•Game VFX that gradually transform items (e.g., weapon upgrades) or characters (e.g., power-up evolutions).
•AR/VR filters that create dynamic object transformations in real time for interactive experiences.
•Education demos showing natural processes (seed→sapling→tree) as intuitive 3D morph sequences.
•Product design previews that morph between prototypes or materials to aid decision-making.
•3D style transfer for assets: keep structure but morph textures and fine details to new artistic styles.
•Disentangled morphing: change global shape while preserving local details (or vice versa) for controlled edits.
•Dual-target morphing: blend structure from one object and detail from another to craft novel hybrids.
•Previsualization in advertising: smoothly morph product models to demonstrate features or variants.
•Rapid concept art in 3D: explore design spaces by morphing between reference objects without retraining.

Version: 1