Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models

Ruisi Zhao; Haoren Zheng; Zongxin Yang; Hehe Fan; Yi Yang

Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models

Intermediate

Ruisi Zhao, Haoren Zheng, Zongxin Yang et al.2/10/2026

arXiv

Key Summary

•Stroke3D lets you draw simple 2D stick-figure strokes plus a short text, and it builds a ready-to-animate 3D model with a skeleton and textures.
•It splits the job into two stages: first it creates a controllable 3D skeleton from your strokes; then it grows a textured mesh around that skeleton.
•The skeleton stage uses a graph VAE and a diffusion transformer (Sk-VAE + Sk-DiT) so structure stays clean and matches your strokes and words.
•For the mesh stage, the authors build a new dataset (TextuRig) of textured, rigged models with captions and use it to train a skeleton-to-mesh generator.
•They further align meshes to skeletons with SKA-DPO, a preference-learning trick guided by a Skeleton Alignment (SKA) score.
•Compared to strong baselines, Stroke3D gets lower Chamfer Distances (better skeleton accuracy) and higher SKA scores (better mesh-skeleton match).
•The system is robust to small drawing mistakes, different viewpoints, and can handle edits like adding wings or moving joints.
•Outputs can be skinned automatically in Blender and animated without falling apart, enabling quick AR/VR, game, and film prototyping.
•Limitations include data bias, fewer rare poses, and quality depending on stroke completeness; future work targets bigger, more varied datasets and end-to-end text-to-rigged-3D.
•This is the first framework to lift user-drawn 2D strokes directly into a rigged, textured, animatable 3D model.

Why This Research Matters

Stroke3D takes a simple sketch and turns it into a moving 3D character, shrinking the gap between imagination and animation. This empowers students, indie creators, designers, and researchers who don’t have years of 3D rigging experience. Faster prototyping means more ideas get tested in games, AR/VR, and films, which can speed up creative industries. By aligning meshes tightly to skeletons, it reduces frustrating cleanup and produces animations that hold together better. The dataset curation (TextuRig) and preference optimization (SKA-DPO) also set useful patterns for future 3D generation research. Overall, it pushes 3D creation toward being as easy and playful as drawing, but with professional-grade results.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how stop-motion animators put little armatures (skeletons) inside puppets so they can bend arms and legs smoothly? 3D artists do the same thing: they put a hidden skeleton inside a surface (the mesh) so characters can move. But making those rigged 3D assets takes lots of software wizardry and time.

🍞 Hook: Imagine you could sketch a stick figure and type “a bird is flying,” and minutes later have a 3D bird you can animate. That’s the dream.

🥬 The Concept (Mesh Rigging): Rigging means attaching a virtual skeleton to a 3D surface so it can bend naturally.

How it works: 1) Build bones and joints; 2) Bind (skin) the mesh to those bones; 3) Move joints to pose/animate; 4) The mesh follows.
Why it matters: Without rigging, you get a stiff statue that can’t act, run, or fly. 🍞 Anchor: In games, when a character raises an arm, rigging ensures the sleeve and elbow bend correctly.

The world before: Recent 3D generators could make pretty shapes, but often like statues—they didn’t come with skeletons, so animating them was hard. Other tools could guess skeletons after the mesh was made, but control was fuzzy: bones might appear in weird places or miss where you truly needed them. Also, training good mesh-to-skeleton systems needs many examples with both meshes and high-quality skeletons, and those are rare—especially with good textures.

The problem: Two big roadblocks stopped easy animation-ready creation.

Animatable geometry is hard. Many text-to-3D pipelines produce static outputs without a usable skeleton.
Limited structural control. Even advanced auto-rigging often guesses bone layouts, leading to extra or missing bones and unpredictable rigs.

Failed attempts: Mesh-first pipelines tried to add skeletons later. But when the mesh is already baked, the tool has to “guess” a skeleton that fits, which can be brittle. Autoregressive skeleton-from-mesh methods got better with big datasets, but still lacked an easy way for users to say, “Put a joint here!”

The gap: What if we flipped the script—let people “draw the bones first” so the computer knows the structure from the start, then build the mesh around it? And what if the system could read both the drawing and a short text like “a dinosaur ready to pounce” to nail the pose and identity?

Real stakes: This matters for more than cool demos.

Indie game devs and students could prototype animated characters in hours, not weeks.
AR/VR creators could quickly test interactive avatars.
Film previz teams could iterate poses and body plans without heavy modeling.
Educators could teach animation principles with simple sketches.
Robotics simulators could get articulated models aligned to intended joint layouts.

To solve this, Stroke3D introduces a skeleton-first, two-stage pipeline that listens to your strokes (for structure) and your text (for meaning), then delivers a rigged, textured, ready-to-animate 3D asset.

02Core Idea

🍞 Hook: You know how building a house starts with the frame (beams) before adding the walls and paint? If you try to paint first, it makes no sense. 3D characters are similar: start with the skeleton (frame), then add the body (mesh) and colors (textures).

Aha! Moment (one sentence): Stroke3D makes animatable 3D by first generating a controllable 3D skeleton from your 2D strokes and text, then growing a textured mesh that aligns tightly to that skeleton.

Three analogies:

Blueprint-first: Draw the stick-figure blueprint (your strokes), then the system builds the 3D house with sturdy beams and nice walls.
Puppet-making: Bendable wires (skeleton) are shaped exactly as you sketch, then the puppet’s body (mesh) is wrapped on, and finally dressed (textures).
Baking: You set the cake mold (skeleton shape) with your strokes, pour in batter (mesh), and decorate (textures), so the final cake matches the mold.

Before vs After:

Before: Mesh-first systems guessed skeletons later, causing wobbly control and mismatches.
After: Skeleton-first ensures bones land exactly where you draw, and the mesh is trained to hug that skeleton.

Why it works (intuition): Separating “where the moving parts go” (skeleton) from “what it looks like” (mesh + textures) keeps each task simple and controllable. The strokes remove structural ambiguity (how many limbs? where are joints?), while text locks in identity and pose. Operating in a compressed “latent” space smooths learning, like designing with Lego blocks instead of raw clay.

Building blocks (introduced in dependency order, each with a sandwich explanation):

🍞 Hook: Imagine sculpting with invisible foam blocks that make hard tasks easier. 🥬 The Concept (Latent Diffusion Model): It’s a generator that creates new things by first adding noise in a hidden space and then learning to remove it step by step.

How it works: 1) Move your data into a compact latent space; 2) Add noise to scramble it; 3) Train a model to denoise; 4) Start from noise and denoise to create new samples.
Why it matters: Working in latent space is faster and more stable than raw pixels or raw coordinates. 🍞 Anchor: Instead of painting every pixel, the model paints rough shapes in a small room, then brings them back to full detail.

🍞 Hook: You know how a map shows cities (nodes) and roads (edges)? 🥬 The Concept (Skeletal Graph VAE, Sk-VAE): A VAE that encodes a skeleton graph (joints + bones) into a compact code and decodes it back.

How it works: 1) Read 3D joint positions and connections; 2) Compress into a latent vector; 3) Later, decode back into joint positions given the same graph edges.
Why it matters: Gives a smooth “language” for skeletons so diffusion can generate valid, realistic bone layouts. 🍞 Anchor: It learns typical distances like “elbow near shoulder” so joints don’t end up in goofy places.

🍞 Hook: Think of a conductor who listens to a melody (text) while guiding a choir (graph nodes) to sing together. 🥬 The Concept (Skeletal Graph Diffusion Transformer, Sk-DiT): A diffusion transformer that generates skeleton latents on graphs, guided by text and your stroke structure.

How it works: 1) Start with noisy latent; 2) Use graph-aware attention to pass messages only along bones; 3) Cross-attend to text for semantics; 4) Condition on your 2D strokes for structure; 5) Denoise to a clean skeleton code.
Why it matters: Combines meaning (text) and shape (strokes) so the 3D skeleton matches what you want. 🍞 Anchor: You draw a long tail and type “dinosaur”; Sk-DiT yields a dinosaur-like skeleton with that tail.

🍞 Hook: A cookbook with many pictures helps you cook better meals. 🥬 The Concept (TextuRig Dataset): A curated set of textured, rigged 3D models with captions.

How it works: 1) Start from Objaverse-XL rigged subset; 2) Reprocess to keep textures; 3) Filter for quality; 4) Auto-caption with a vision–language model.
Why it matters: Training needs good examples; TextuRig teaches how textures and meshes should look around skeletons. 🍞 Anchor: If the caption says “husky dog with big ears,” the trained model learns ear shapes and fur patterns.

🍞 Hook: Like a food taster choosing the better dish. 🥬 The Concept (Skeletal Alignment Score, SKA score): A score that measures how well the mesh lines up with the given skeleton across views.

How it works: 1) Generate multiple mesh candidates; 2) Score each for skeleton–mesh match; 3) Prefer the higher-scoring one.
Why it matters: Keeps elbows near elbow bones, knees near knee bones, etc. 🍞 Anchor: If two dogs are generated, the one whose legs fit the leg bones wins.

🍞 Hook: A coach watches pairs of attempts and says, “This one is better—do more like this.” 🥬 The Concept (Direct Preference Optimization, DPO → SKA-DPO here): A training method that makes the model favor “winner” samples over “losers” based on preferences.

How it works: 1) Make pairs of candidates; 2) Use SKA score to pick winner vs loser; 3) Train so the model moves toward winners.
Why it matters: Meshes become more structurally faithful to skeletons over time. 🍞 Anchor: Over training, shoulders align better to shoulder joints because winners had that property.

Bottom line: Skeleton-first design + stroke conditioning + text semantics + preference alignment = intuitive control, better skeletons, and meshes that actually fit their bones.

03Methodology

At a high level: Input (2D strokes + text) → Stage 1: Skeleton generation (Sk-VAE encode/Sk-DiT generate/Sk-VAE decode) → Stage 2: Mesh synthesis (train on TextuRig, then refine with SKA-DPO) → Output: Textured, animatable 3D model.

Step 0: Data preparation

What happens: The team builds a good training set and tools. They: (a) align and caption skeleton–mesh pairs using a vision–language model; (b) curate TextuRig (textured, rigged, captioned); (c) provide a canvas where users click joints and draw connections as 2D strokes.
Why it exists: Models need clean, well-labeled examples, and the canvas makes stroke inputs topologically match 3D skeletons.
Example: Render front/side/top views of a model and ask a VLM to describe it: “A wolf is standing calmly.”

🍞 Hook: Like a storyteller who sees pictures and writes captions. 🥬 The Concept (Vision–Language Model, VLM): An AI that understands images and words together.

How it works: 1) Show it multi-view renders; 2) It reads shapes and pose; 3) Writes a helpful caption.
Why it matters: Captions give text guidance so later models know “what” (a wolf) and “how” (standing calmly). 🍞 Anchor: The label “bird flying” teaches the pose and species together.

Step 1: Sk-VAE learns skeleton latents

What happens: Treat a skeleton as a graph: nodes are joints (with XYZ), edges are bones. The Sk-VAE encoder compresses this into a latent vector; the decoder learns to reconstruct joints given the edges.
Why it exists: Puts skeletons into a smooth space where similar structures are close, making generation stable.
Example: A human T-pose and a human A-pose land near each other in latent space.

Step 2: Sk-DiT generates skeletons from strokes + text

What happens: During training, they simulate strokes by projecting 3D skeletons to 2D and adding small jitters (like hand drawing). At inference, your real 2D strokes (joint XY + edges) are embedded and concatenated with noisy latents. Sk-DiT denoises step by step, guided by text.
Why it exists: Text alone can’t fix structure (how many legs?), and strokes alone can’t add identity (is it a bird or a dino?). Combining both resolves ambiguity.
Example: You draw a long spine and two legs and type “raptor ready to pounce.” The model places joints accordingly in 3D.

Inside Sk-DiT (attention and conditioning):

🍞 Hook: When you study, you don’t treat every word equally—you focus on the important ones. 🥬 The Concept (Self-Attention): A way for a model to decide which parts of the input to focus on when processing itself.

How it works: 1) Look at all tokens/nodes; 2) Score importance between pairs; 3) Mix information based on scores.
Why it matters: Lets joints talk to their neighbors to keep a consistent skeleton. 🍞 Anchor: Elbow pays more attention to shoulder and wrist than to tail.

🍞 Hook: It’s like matching subtitles to a movie—you line up language with the right scene. 🥬 The Concept (Cross-Attention): Connects one stream (e.g., text) to another (e.g., skeleton nodes) so they inform each other.

How it works: 1) Turn text into embeddings; 2) Each joint looks up relevant words; 3) Blend in semantic hints.
Why it matters: Ensures the same bones form very different creatures depending on the words. 🍞 Anchor: The same four limbs become hooves for a “horse” but paws for a “dog,” guided by text.

Step 3: Decode to a 3D skeleton

What happens: After Sk-DiT predicts a clean latent, the Sk-VAE decoder outputs the final 3D joint coordinates consistent with the stroke-defined edges.
Why it exists: Guarantees valid, realistic skeletons that respect the drawn topology.
Example: If you connected a wing bone to the spine, the decoder preserves that connection in 3D.

Step 4: Mesh synthesis trained with TextuRig

What happens: Using TextuRig’s paired skeleton–mesh–texture–caption data, the skeleton-conditioned mesh generator (based on SKDream/MVDream) learns to produce multi-view images that reconstruct into a textured mesh fitting the skeleton.
Why it exists: Good textured, rigged examples are rare; TextuRig plugs the gap so meshes look right and are texture-consistent.
Example: “Husky with big ears” produces a fluffy head and large ears attached correctly to skull joints.

🍞 Hook: Decorating a plain cake with icing and sprinkles makes it look real and tasty. 🥬 The Concept (Texture Synthesis): Creating surface colors and patterns that make the 3D object look realistic.

How it works: 1) Generate multi-view images; 2) Reconstruct geometry; 3) Project or optimize textures; 4) Ensure views agree.
Why it matters: Without texture, outputs look like gray clay. 🍞 Anchor: Bark patterns on a tree stump or feather colors on a bird’s wing.

Step 5: Preference fine-tuning with SKA-DPO

What happens: For each skeleton + caption, generate two mesh candidates. Score them with the SKA score (skeleton–mesh alignment). Use DPO to make the model prefer the better one.
Why it exists: Tightens the mesh around its bones so animation is stable.
Example: If one candidate puts the knee mesh too far from the knee joint, it loses; future generations learn to keep knees aligned.

Secret sauce:

Skeleton-first pipeline removes structure guessing.
Stroke conditioning puts users in the driver’s seat.
Latent graph diffusion + cross-attention blends structure and semantics smoothly.
TextuRig supplies the missing texture-rich, rigged data.
SKA-DPO steadily pushes generations toward better skeleton–mesh fit.

Finally, auto-skinning tools (e.g., in Blender) bind the produced mesh to the generated skeleton for animation with minimal manual cleanup.

04Experiments & Results

The test: The team checks two things: (1) How close the generated skeleton is to the ground truth; and (2) How well the final mesh aligns to the skeleton across views.

🍞 Hook: Comparing shapes is like checking if two puzzle pieces match. 🥬 The Concept (Chamfer Distance): A way to measure how far two point sets are from each other.

How it works: 1) For every point in set A, find the nearest in set B; 2) Average the distances; 3) Do the reverse; 4) Add them.
Why it matters: Smaller is better—joints and bones land where they should. 🍞 Anchor: Measuring elbow-to-elbow and knee-to-knee gaps between predicted and true skeletons.

🍞 Hook: A tailor checks if the suit fits the mannequin’s shoulders and knees. 🥬 The Concept (SKA score): A score telling how well mesh geometry lines up with the skeleton across views.

How it works: 1) Render views; 2) Extract features; 3) Compute alignment between bone positions and mesh parts; 4) Average across views.
Why it matters: Good alignment means natural bending during animation. 🍞 Anchor: The mesh shoulder pad stays near the shoulder joint instead of floating away.

The competition: Stroke3D is compared with strong baselines in two arenas.

Skeleton generation: RigNet, SKDream, MagicArticulate, UniRig.
Skeleton-to-mesh: SDEdit and SKDream (using their evaluation protocol).

Scoreboard with context:

Skeleton accuracy: Stroke3D achieves the lowest Chamfer Distances across most categories (characters, animals, plants). Think of it as getting an A+ while others hover around B to B+—it places joints and bones more precisely and consistently.
Mesh alignment: Adding TextuRig to the training boosts SKDream-style mesh generation notably, and SKA-DPO pushes it further. The final Mean Instance SKA score reaches about 87.8–87.9, a clear jump over the baseline—like moving from a solid B+ to an A.

Surprising findings:

Structural conditioning (using your 2D strokes) speeds up learning and improves stability—without it, convergence is slower and less reliable.
The system is robust to small stroke mistakes or missing joints (dropping a few joints still yields good skeletons), and it is fairly view-invariant (side or top views can still work if enough structure is visible).
A modest preference margin (around 0.10) in SKA-DPO gives the best boost, balancing signal strength and noise.

Qualitative highlights:

When generating multi-view images for mesh synthesis, Stroke3D’s views fit the input skeleton better than the plain baseline, reducing artifacts like warped limbs or collapsing torsos.
After auto-skinning in Blender, animations stay stable—no dramatic mesh tearing around elbows or knees—indicating good skeleton–mesh binding.

Overall: The experiments show the skeleton-first design, plus TextuRig and SKA-DPO, deliver both cleaner skeletons and more faithful, animation-ready meshes than prior methods.

05Discussion & Limitations

Limitations:

Data coverage: Some rare categories or unusual poses aren’t well represented, so outputs can be weaker there.
Stroke ambiguity: If strokes hide important joints (e.g., extreme side views with overlap), the model has to guess, which can reduce accuracy.
Texture variance: While TextuRig helps a lot, textures of very niche subjects may still lack detail.
Dependency on auto-skinning: Final rig quality can still depend on skinning tools/settings.

Required resources:

Training needs GPUs (e.g., A100 class) and curated datasets (MagicArticulate, TextuRig).
Inference is lighter: you need the canvas tool for strokes and a CLIP-like text encoder, plus the trained Sk-VAE/Sk-DiT and mesh module.

When not to use:

If you need exact, production-ready anatomical rigs with studio-grade constraints (e.g., film hero assets), heavy manual rigging may still be preferred.
If you lack clear structural guidance (scribbles without connected joints), the skeleton may not follow your intent.
If you need rare, highly specific textures (e.g., a particular historical armor pattern) without references, results may be generic.

Open questions:

How far can stroke conditioning go for very complex topologies (e.g., multi-segment wings + accessories)?
Can we integrate pose diversity and rare categories by scaling TextuRig further or using synthetic data augmentation?
Could an end-to-end text-to-rigged-3D model retain control while simplifying the pipeline?
How to better interpret/stabilize cross-attention so text cues always alter the right joints?
Can preference signals expand beyond SKA (e.g., animation smoothness or physical plausibility)?

06Conclusion & Future Work

Three-sentence summary: Stroke3D turns your 2D strokes and a short text into a rigged, textured 3D model by first generating a controllable skeleton and then building a mesh that tightly fits it. It trains a skeletal graph VAE and a diffusion transformer to honor user-drawn structure and text meaning, and uses a new TextuRig dataset plus SKA-DPO preference learning to improve mesh–skeleton alignment. The result is an intuitive, animation-ready workflow that outperforms prior methods on both skeleton accuracy and final mesh fidelity.

Main achievement: Pioneering a skeleton-first, stroke-conditioned pipeline that gives everyday users precise control over structure while delivering high-quality, animatable 3D assets.

Future directions:

Scale to more poses and rarer categories with larger, richer datasets and augmentations.
Move toward an end-to-end text-to-rigged-3D system without sacrificing control.
Expand preference signals (e.g., motion stability, self-collision, physical plausibility) to further polish results.

Why remember this: Stroke3D is the first to directly lift 2D strokes into rigged 3D, proving that blending simple user sketches with modern diffusion and preference optimization can make high-quality animation assets accessible to everyone—from students to indie creators to studios.

Practical Applications

•Rapid prototyping of game characters by sketching poses and structures before level testing.
•AR/VR avatar creation from simple stroke drawings and short descriptions for social apps.
•Education tools where students learn rigging concepts by drawing joint layouts and seeing instant 3D feedback.
•Pre-visualization in film and advertising to explore character silhouettes, poses, and motion quickly.
•Robotics simulation models with explicit joint placement for testing articulated motion plans.
•Character kitbashing: add wings, tails, or extra limbs by drawing new strokes to edit structure.
•Creature design: iterate on anatomy (e.g., longer neck, digitigrade legs) while ensuring animatable rigs.
•Medical or sports visualization prototypes that illustrate joint motion intuitively (non-diagnostic).
•Indie creators producing VTuber-style avatars with controllable joints and expressive poses.
•Research testbeds for evaluating skeleton–mesh alignment and new preference-learning signals.

Version: 1