Skin Tokens: A Learned Compact Representation for Unified Autoregressive Rigging

Jia-peng Zhang; Cheng-Feng Pu; Meng-Hao Guo; Yan-Pei Cao; Shi-Min Hu

Skin Tokens: A Learned Compact Representation for Unified Autoregressive Rigging

Intermediate

Jia-peng Zhang, Cheng-Feng Pu, Meng-Hao Guo et al.2/4/2026

arXiv PDF

Key Summary

•Rigging 3D characters is a bottleneck: making bones and skin weights by hand is slow and tricky, and past automatic tools often guess the skin weights poorly.
•This paper turns skinning weights into tiny Lego-like codes called SkinTokens so the computer predicts a few smart tokens instead of millions of numbers.
•A special encoder-decoder (FSQ-CVAE) learns these tokens by compressing sparse skin weights while keeping important details for animation.
•TokenRig is a single generator that writes the whole rig as a sequence: first the skeleton tokens, then the SkinTokens, so bones and skin are learned together.
•They fine-tune TokenRig with reinforcement learning (GRPO) using geometry-based rewards that encourage bones to sit inside the mesh and deformations to stay smooth.
•Across standard datasets, SkinTokens improve skinning accuracy by about 98%–133% over strong baselines, and RL boosts bone prediction by 17%–22%.
•The method stays robust on messy, out-of-distribution models (like creatures with wings or tails) that earlier systems often mishandle.
•SkinTokens are highly compact (over 180× compression), enabling fast, scalable pipelines for games, film, VTubing, and generative 3D.
•This unified approach reduces error propagation between separate stages and produces cleaner, artifact-free deformations in motion.

Why This Research Matters

High-quality auto-rigging shortens production cycles for games, film, and AR/VR by turning raw 3D assets into animation-ready characters quickly. A tokenized skinning representation means less memory, faster inference, and better stability on messy, real-world meshes. Unified generation of bones and skin prevents mismatch errors that cause ugly deformations, raising final animation quality. Reinforcement learning rewards inject practical rigging rules so the system handles unusual anatomy like wings or tails without manual fixes. As generative 3D models proliferate, this approach scales rigging to meet demand while keeping artists focused on creative polish rather than repetitive setup.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

Imagine you’re building a puppet. You add sticks inside (bones) so it bends at the right places, and you glue the cloth outside (skin) so it moves smoothly. For digital characters, this is called rigging: we place a skeleton (joints and bones) inside the 3D mesh and assign skinning weights that tell each vertex how much to follow each bone. For years, artists have had to do this by hand, and even with tools, it’s slow and specialized. Meanwhile, 3D generation models exploded, creating tons of characters that still need rigs to animate. The result: a traffic jam. So many models, not enough rigs. The world before: Automatic rigging focused first on skeletons. Early geometric tricks could carve out a rough skeleton inside a clean, watertight mesh, but they didn’t understand semantics (they couldn’t tell an ear from a horn). Later, deep models started to predict skeletons more robustly—even turning skeletons into tokens to feed Transformers—so skeletons got better. But the second half, skinning weights, lagged behind. Most methods treated it as a giant regression problem: directly predicting an N × J matrix (N vertices, J joints). That’s millions of numbers, mostly zeros because each vertex really listens to just a few nearby bones. Training models on this huge, mostly-zero matrix is like trying to learn a song from an hour of silence sprinkled with a few notes—it’s inefficient and unstable. The problem: When skinning is learned separately from skeletons, the two never teach each other. The skeleton is built without thinking how the skin will stretch; the skin is computed for a fixed skeleton even if that skeleton could be better. The decoupling caps overall quality. Worse, meshes in the real world are messy: non-watertight, disconnected parts (clothes, belts, hair), or unusual anatomy (wings, tails). Methods that depend on fragile geometric features (like geodesic distances) often break on such meshes, causing weight “bleeding” across gaps or bones that poke outside the body. Failed attempts: Direct regression with common losses like MSE or BCE tends to average things out (mean-seeking) or under-emphasize the important non-zero regions due to extreme sparsity. Some tried clever geometric descriptors to guide weights, but those fail on messy meshes. Others focused only on skeletons and left skinning to a second model, which reintroduces the same mismatch. The gap: We needed a smarter way to represent skin weights—something compact, discrete, and semantically meaningful—so that a single model could learn both the bones and the skin together, and also be refined by how the character actually deforms in motion. Real stakes: If you’re a game developer, animator, VTuber, or in AR/VR, rigging speed and quality decide whether your characters ship this month or next. In education or indie projects, better auto-rigging unlocks animation for everyone. In big studios, automating 80% and letting artists polish the last 20% can save huge time and reduce repetitive work. 🍞 Top Bread (Hook): You know how it’s easier to build with Lego bricks than to sculpt clay from scratch? Bricks snap together and keep structure. 🥬 The Concept (Auto-regressive Models): What it is: An auto-regressive model makes a sequence by predicting one piece at a time, using what it already wrote to decide the next piece. How it works: (1) Start with a special start token. (2) Predict the next token. (3) Add it to the sequence. (4) Repeat until done. Why it matters: Without this step-by-step plan, the model would try to decide everything at once and miss long-range patterns, like how the left arm relates to the right leg in a pose. 🍞 Bottom Bread (Anchor): Writing a sentence: the model chooses one word, then uses that to pick the next, so the whole sentence stays coherent.

02Core Idea

The “Aha!” moment: If skinning weights are mostly zeros with small important islands, don’t regress the whole ocean—learn a tiny set of smart tokens that describe those islands, and generate the skeleton and these tokens together as one story. Multiple analogies:

Music compression: Instead of saving every sound sample, we save notes and instruments. SkinTokens are the notes; the decoder plays them back as smooth skinning.
Recipe cards: Rather than listing every grain of sugar, you keep a short card with key steps. SkinTokens are the short card for complex weight patterns.
Subway map: A few colored lines explain the whole network. SkinTokens trace influence lines from bones to surface regions. Before vs After:
Before: Two models, two problems. First predict a skeleton, then separately regress a huge weight matrix. Errors don’t inform each other. Training struggles with sparsity and messy geometry.
After: One model, one sequence. It writes skeleton tokens first, then SkinTokens that describe each bone’s influence. The model “knows” how bones and skin interact, and can be further tuned with motion-aware rewards. Why it works (intuition): The skinning matrix is intrinsically sparse and structured: vertices near a bone share similar influence, and each vertex is influenced by a small set of bones. A learned tokenization captures these repeatable local patterns in a compact form. Once weights become discrete tokens, the whole rig becomes a language the Transformer can write coherently, bone by bone, skin by skin. And because it’s generative, we can nudge it with reinforcement learning using geometry and motion checks to handle weird, out-of-distribution characters. Building blocks:
Learn SkinTokens with a conditional VAE and Finite Scalar Quantization (FSQ) so weights become short discrete codes.
Build TokenRig, a single auto-regressive Transformer that outputs a unified sequence: skeleton tokens followed by SkinTokens.
Add RL (GRPO) with four rewards—joint coverage, bone containment, skin coverage/sparsity, deformation smoothness—so generated rigs are valid and animate well beyond the training set. 🍞 Top Bread (Hook): Imagine sorting your closet into labeled boxes instead of leaving clothes scattered everywhere; finding outfits becomes easy. 🥬 The Concept (SkinTokens): What it is: SkinTokens are compact, discrete codes that describe a bone’s skinning influence pattern over the mesh. How it works: (1) Look at the mesh and its true skin weights. (2) Encode the important, sparse weight regions into a short latent vector. (3) Quantize it into a few discrete tokens. (4) Decode tokens back to smooth weights. Why it matters: Without SkinTokens, the model must predict millions of tiny numbers; with them, it only predicts a handful of precise codes that are easier to learn and reuse. 🍞 Bottom Bread (Anchor): For a forearm bone, a few tokens can capture “weights hug the forearm cylinder, taper near the wrist,” instead of listing values for every vertex. 🍞 Top Bread (Hook): You know how you write a story from start to finish so that the plot makes sense? 🥬 The Concept (TokenRig): What it is: TokenRig is a single generator that writes one sequence containing both the skeleton and the SkinTokens, conditioned on the mesh shape. How it works: (1) Encode the mesh into a global shape embedding. (2) Autoregressively output tokens: first all joints and bones, then for each bone, its SkinTokens. (3) Decode SkinTokens back to skinning weights. Why it matters: Without one shared sequence, bones and skin are predicted in isolation; TokenRig learns their cause-and-effect together, reducing mismatches and artifacts. 🍞 Bottom Bread (Anchor): When you ask it to rig a dragon, it writes tokens for the spine and tail bones first and then writes SkinTokens that wrap those bones with the right influence along the tail. 🍞 Top Bread (Hook): Think of turning a long paragraph into a few keyword tags that still capture the meaning. 🥬 The Concept (FSQ-CVAE): What it is: A Conditional Variational Autoencoder that compresses skin weights into a small vector and then quantizes each number to a fixed set of levels (FSQ), producing discrete tokens. How it works: (1) Two encoders read the mesh and weights. (2) The skin latent is created and passed through fixed-level quantizers (no fragile learned codebook). (3) A decoder reconstructs weights, trained with losses that focus on the sparse, important areas. Why it matters: Without FSQ-CVAE, tokens wouldn’t capture clean, reusable patterns, and training would overfit to zeros. 🍞 Bottom Bread (Anchor): A hand’s finger weights compress into a few tokens that, when decoded, paint smooth influence along each finger. 🍞 Top Bread (Hook): Imagine getting points for good moves and learning to avoid bad ones in a game. 🥬 The Concept (Reinforcement Learning): What it is: A training step where the model tries rig variations and gets rewards for good geometry and motion, improving over time. How it works: (1) Generate several rig candidates. (2) Score them with rewards: bones inside the mesh, joints covering all parts, sparse-but-complete skinning, smooth deformations. (3) Update the model toward higher-reward sequences. Why it matters: Without RL, the model may stick to average solutions and miss unusual parts (like wings or tails). 🍞 Bottom Bread (Anchor): On a creature with horns, the RL rewards encourage adding horn bones and keeping them inside the mesh, so the deformation looks natural. 🍞 Top Bread (Hook): You know how a chef tastes a dish after each ingredient to decide the next step? 🥬 The Concept (Auto-regressive Models): What it is: A model that predicts the next token using all previous tokens, building a coherent sequence. How it works: (1) Start token. (2) Predict next. (3) Append and repeat. (4) Stop at end token. Why it matters: Without step-by-step prediction, long structures like skeletons and skins lose consistency. 🍞 Bottom Bread (Anchor): The model writes a spine joint, then shoulders, then arms, each step informed by what it already placed.

03Methodology

At a high level: Input mesh → Learn SkinTokens with FSQ-CVAE → Build unified token sequence (skeleton then SkinTokens) → Autoregressively generate rigs (TokenRig) → Refine via RL with geometry/motion rewards → Output final skeleton and skin weights. Step 1: Learn SkinTokens (FSQ-CVAE)

What happens: The system learns to turn a bone’s sparse skin weights into a short sequence of discrete tokens. Two encoders read the shape and the per-bone weight map; a decoder learns to reconstruct the weights from a compact latent that gets quantized into tokens using Finite Scalar Quantization (FSQ). Training uses a loss that emphasizes non-zero regions (Dice) plus BCE/MSE for stability. Nested dropout teaches the model to do well with fewer tokens, and importance sampling focuses the decoder on active deformation zones.
Why this exists: Directly regressing N × J weights is ill-posed and sparsity-blind. Compressing into tokens gives a small, learnable vocabulary of weight patterns that are easy to predict and reuse.
Example: For a 6,000-vertex character and 30 bones, the raw FP16 skinning weights would take millions of numbers. With SkinTokens (e.g., 32 tokens per bone), the storage drops over 180× while keeping shape-aware details. How FSQ-CVAE works practically
Encoders: A shape encoder processes mesh points (XYZ and normals). A skin encoder processes the per-bone weight distribution. Both follow a set-based design that doesn’t care about vertex order. Outputs form a continuous latent for skin weights.
FSQ quantization: Each latent dimension snaps to the nearest level on a fixed grid (no learned codebook). Straight-Through Estimator passes gradients during training, making it simple and stable.
Losses for sparse data: BCE models weights in [0,1], Dice boosts gradients where weights are non-zero (so the model pays attention to the important islands), and a small MSE term stabilizes regression around boundaries.
Efficiency helpers: Nested dropout randomly shortens the token sequence during training so the decoder learns to do more with less. Importance sampling shows the decoder extra points from active regions to speed learning. Step 2: Build a unified sequence representation
What happens: We serialize the entire rig as tokens. First come skeleton tokens (quantized joint coordinates plus special type tokens), then each bone’s SkinTokens in a canonical order.
Why this exists: A single sequence lets the model learn long-range dependencies: when predicting the skin for the left arm, it can see the whole skeleton, including the right arm’s placement, torso size, and more.
Example: Sequence: <bos>, skeleton type, spine joint tokens, shoulder, arm, hand tokens, … then SkinTokens for bone 1, SkinTokens for bone 2, …, <eos>. Step 3: Autoregressive generation with TokenRig
What happens: Conditioned on a global shape embedding, a Transformer writes the full sequence. Self-attention allows any token to look at all previous tokens, so SkinTokens are predicted in full awareness of the previously written bones.
Why this exists: Step-by-step generation preserves structure and avoids the brittleness of separate modules.
Example: For a quadruped, the model writes spine and hips, then fore/hind legs, then SkinTokens that wrap weights neatly around limbs without spilling to the torso. Step 4: RL refinement with GRPO
What happens: After supervised training, the model samples multiple rig candidates per mesh. We compute four rewards and push the model toward better ones using Group Relative Policy Optimization (GRPO), which compares candidates within each group.
Why this exists: Supervised data rarely covers “weird” meshes (wings, tails, capes). RL injects geometric and motion common sense beyond the dataset.
Example rewards:
- Volumetric Joint Coverage: Encourage joints to spread across all occupied volume so no body part is left without bones.
- Bone-Mesh Containment: Penalize bones that protrude outside the mesh; keep bones inside.
- Skinning Coverage and Sparsity: Each vertex should be influenced by at least one bone, but not by too many (typically ≤4).
- Deformation Smoothness: Under random poses with Linear Blend Skinning, edges shouldn’t stretch wildly; encourage smooth motion.
GRPO details: For each mesh, generate a group of sequences, normalize rewards within the group (a built-in baseline), clip updates for stability, and add a KL penalty to stay close to the supervised policy. Invalid sequences get zero reward to discourage broken rigs. Secret sauce
Turning a nasty, high-dimensional, sparse regression into a short token prediction problem (SkinTokens) unlocks powerful sequence models.
Unifying bones and skin in one sequence lets attention learn cross-effects that separate models miss.
RL with geometry/motion rewards closes the gap on out-of-distribution characters, especially with unusual anatomy or messy meshes. Concrete walk-through example
Input: A stylized fox mesh with big tail and slender legs.
Stage 1: FSQ-CVAE learns finger/leg/tail weight patterns as short tokens.
Stage 2: TokenRig writes skeleton tokens: spine, hips, legs, tail chain.
Stage 3: For each tail bone, it writes SkinTokens encoding “soft, tapering influence along tail.”
Stage 4: RL rewards encourage full tail coverage (joint coverage), bones inside fur volume (containment), tidy sparse weights per vertex, and smooth tail bends in motion.

04Experiments & Results

The test: Evaluate both skeleton and skinning quality. For skeletons, use Chamfer-style distances: Joint-to-Joint (J2J), Joint-to-Bone (J2B), Bone-to-Bone (B2B). For skinning, use L1 error (lower is better), Precision/Recall of active influence regions, and Motion Loss that measures deformation stretch under random poses. Also test compression quality and semantics of the learned tokens. The competition: Strong baselines include RigNet, MagicArticulate, UniRig, and Puppeteer—covering graph-based, diffusion/sequence, and unified skeleton pipelines, often with separate skinning regressors. The scoreboard with context:

Skinning accuracy: TokenRig with SkinTokens reduces L1 error dramatically (e.g., 0.0163 vs. 0.0573 on ModelsResource), a leap of about 98%–133% over leading methods. Think of it as moving from blurry coloring outside the lines to crisp shading that stays inside each part.
Precision/Recall: Higher precision and recall mean the model puts influence in the right regions and covers all necessary areas—fewer “bleeding” artifacts to unrelated parts.
Motion Loss: Lower motion distortion indicates that skinning isn’t just numerically close; it animates well, with fewer stretchy or spiky edges when posing.
Skeleton accuracy: J2J and B2B improve across datasets (roughly 17%–22% boost in bone prediction after RL), reflecting better joint placement and connectivity, especially on complex shapes.
Compression: SkinTokens achieve over 180× compression versus raw FP16 weights with high IoU overlap of active regions, proving that a few tokens really can capture the essential deformation patterns. Surprising findings:
Few tokens go far: Even 4–6 SkinTokens per bone preserve most of the structure (strong IoU and low L1), showing that skinning patterns are highly compressible.
Semantic clusters: The continuous latents (before quantization) naturally group by body parts (head, hips, legs) across very different meshes, suggesting the model learns anatomy-like concepts.
RL helps especially on oddities: Wings, tails, horns, and clothing accessories often need extra bones or careful containment; RL rewards guide the model to add and place these well, even without explicit training examples. Qualitative highlights:
Baselines often show weight “bleeding” across disconnected parts (like fingers bleeding into palm or sleeve). TokenRig’s decoded weights are crisp and local, especially on tricky fine structures like fingers and tails.
Skeletons from TokenRig are complete but not cluttered—fewer missing chains than graph/MST methods and fewer redundant, hard-to-animate joints than some autoregressive baselines. Takeaway: Treating skin weights as tokens and unifying the sequence creates a step-change in both static metrics and animation quality. The model not only matches ground truth better but produces motion that looks cleaner and more natural.

05Discussion & Limitations

Limitations

Edge-case precision: In extremely complex deformations (e.g., layered clothing colliding with thin anatomy), a continuous-latent VAE can sometimes reconstruct subtle gradients slightly better than the quantized version.
Production constraints: Studios may require specific joint names, counts, or hierarchies. The current method is fully automatic; it needs an interface for user-specified templates or constraints.
Physics realism: Rewards focus on geometry and smoothness, not full physical plausibility. Without physics-based priors, fast motions could still produce unrealistic jiggle or squash in rare cases. Required resources
Data: Paired meshes with skeletons and skin weights for supervised training, plus a set of challenging OOD meshes for RL.
Compute: Training a CVAE (hundreds of thousands of steps) and a Transformer policy, then running short RL fine-tuning. Inference is lightweight compared to training.
Tooling: Mesh preprocessing (normalization, voxelization for rewards), pose sampling for motion checks, and export to common DCC/engine formats. When NOT to use
Non-skeletal deformations: If your character uses cloth simulation, soft-body physics, or blendshape-only rigs, this LBS-centric approach is not ideal.
Ultra-precise film shots where an artist needs handcrafted, shot-specific weight tweaks; the model can provide a strong starting point, but final polish might still be manual. Open questions
Hybrid tokenization: Can continuous tokens or learned, adaptive quantization improve the last few percent without losing the benefits of discreteness?
Interactive co-piloting: How best to let artists steer topology, lock certain bones, or sketch influence zones, while the model fills in the rest?
Richer rewards: Can physics-based or task-specific rewards (e.g., foot skating penalties, contact preservation) further improve realism?
Generalizing across modalities: How well do SkinTokens transfer to different mesh resolutions, styles (toon vs. realistic), or rig conventions without re-training?

06Conclusion & Future Work

In three sentences: This paper reframes skinning from a giant, sparse regression into a compact token prediction problem using SkinTokens. With TokenRig, a single autoregressive model generates skeletons and SkinTokens together, then reinforcement learning refines results using geometry and motion rewards. The outcome is stronger accuracy, smoother animation, and better generalization to unusual meshes than prior methods. Main achievement: Turning skinning into learned discrete tokens and unifying bones and skin into one generative sequence—then polishing with RL—produces a big jump in fidelity and robustness. Future directions: Add interactive controls and template constraints for production, explore continuous or hybrid token representations for the hardest cases, and incorporate physics-aware rewards to further improve motion realism. Why remember this: It’s a representation shift—once skinning becomes compact tokens, the whole rig becomes “a language” a model can write reliably, enabling fast, scalable, high-quality rigging for today’s flood of 3D assets.

Practical Applications

•Batch auto-rigging of large character libraries for games and film pipelines.
•Preparing VTuber avatars with cleaner deformations around fingers, hair, and clothing.
•Speeding up prototyping in indie studios by generating a good rig baseline in minutes.
•Auto-rigging creatures with unusual anatomy (wings, tails, horns) for fantasy projects.
•Rigging community-made assets (often messy or non-watertight) with fewer artifacts.
•Compressing and transmitting rigs efficiently for cloud or mobile animation apps.
•Assisting education: students can study or edit generated rigs to learn good practices.
•Pre-processing for motion retargeting: consistent skeletons and skin weights improve transfer.
•Rapid iteration in 3D marketplaces: make assets animation-ready to increase adoption.

Version: 1