DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies

Renke Wang; Zhenyu Zhang; Ying Tai; Jian Yang

DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies

Intermediate

Renke Wang, Zhenyu Zhang, Ying Tai et al.1/5/2026

arXiv PDF

Key Summary

•DiffProxy turns tricky multi-camera photos of a person into a clean 3D body and hands by first painting a precise 'map' on each pixel and then fitting a standard body model to that map.
•It is trained only on synthetic (computer-made) images with perfect labels, yet it works great on real photos because it borrows common-sense visuals from a powerful diffusion model.
•A special multi-view mechanism makes all camera views agree with each other using epipolar attention, so the 3D shape is consistent from every angle.
•Tiny hand details are sharpened by a hand refinement step that zooms in on hands as extra views to capture finger poses more accurately.
•At test time, the system samples multiple guesses, measures how much they disagree per pixel, and trusts the most certain areas more, boosting robustness in hard scenes.
•Instead of juggling many noisy cues like keypoints and silhouettes, DiffProxy uses one uniform, dense correspondence signal that directly links pixels to the 3D body surface.
•Across five real-world benchmarks, it achieves state-of-the-art accuracy without training on any real image–mesh pairs, showing strong zero-shot generalization.
•It handles partial views, occlusions, and varied lighting, and can work even with estimated (not ground-truth) cameras with only moderate accuracy loss.
•More camera views generally improve results, and simple test-time scaling (more diffusion samples) further reduces errors.
•The main trade-off is speed (about two minutes per subject), and the current system focuses on one person at a time.

Why This Research Matters

DiffProxy shows we can train entirely on computer-made data and still work excellently on real photos by leaning on diffusion models’ real-world visual knowledge. That means lower data collection costs, fewer privacy concerns, and freedom from label biases that creep into real datasets. For everyday life, this unlocks more accurate VR avatars, telepresence, sports coaching, and medical or rehab tracking without needing expensive motion-capture studios. It can boost safety and training in factories by analyzing multi-camera footage for posture and ergonomics. It also provides a clean, dense supervision signal that is easier to debug and standardize, raising the reliability of research and products. Overall, it’s a step toward robust 3D understanding that’s affordable, scalable, and fairer.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how building a perfect LEGO figure from blurry photos is tough because the pictures can be unclear or mislabeled? If your instructions are wrong, your LEGO model keeps getting the same mistakes.

🥬 The Concept (Human Mesh Recovery, HMR): HMR is teaching a computer to build a full 3D human shape from pictures.

What it is: The computer recovers a 3D body (and hands, face) from one or more images.
How it works: (1) Look at the person in each photo; (2) Guess a 3D pose and shape that could create those photos; (3) Adjust until the 3D person matches the images.
Why it matters: Without HMR, virtual try-ons, rehab tracking, VR avatars, and sports analysis are much harder or less accurate. 🍞 Anchor: Think of a fitness app that watches you from several phones and builds your live 3D avatar to correct your squat.

🍞 Hook: Imagine your teacher graded your homework based on answers copied from a student who sometimes guesses. Your grades would reflect their biases, not the truth.

🥬 The Concept (Annotation Bias in Real Datasets): Real-world HMR labels are often created by fitting algorithms, not perfect scanners.

What it is: Systematic errors (like tilted heads or stiff elbows) baked into the training labels because they came from imperfect optimization.
How it works: (1) Use 2D keypoints/silhouettes; (2) Fit a 3D body; (3) Get stuck in local mistakes that repeat across the dataset.
Why it matters: Models trained on biased labels learn those same mistakes and hit a performance ceiling. 🍞 Anchor: On some benchmarks, many methods tilt heads the same wrong way because the labels did.

🍞 Hook: Picture a video game world: the graphics look real but aren’t. Would a robot trained there drive perfectly on a real street?

🥬 The Concept (Synthetic Data and the Domain Gap): Synthetic data are computer-made images with perfect ground truth; the domain gap is the visual mismatch with the real world.

What it is: Synthetic scenes differ in texture, lighting, and background, making models struggle on real photos.
How it works: (1) Render people with exact 3D info; (2) Train a model; (3) Test on real images with different ‘looks’; (4) Performance drops.
Why it matters: You want synthetic precision without losing real-world generalization. 🍞 Anchor: A soccer video game teaches rules (good), but your timing in a real match still feels off (gap).

🍞 Hook: You know how a friend who’s seen millions of movies can ‘guess’ missing story parts in a new clip? That prior knowledge helps fill gaps.

🥬 The Concept (Diffusion Models as Visual Priors): Diffusion models learn how real images look and can guide other tasks.

What it is: A generative model that denoises random noise into realistic images; its knowledge can guide dense predictions.
How it works: (1) Start with noise; (2) Step-by-step remove noise using patterns learned from huge image datasets; (3) Produce structured outputs when adapted.
Why it matters: These models carry ‘common sense’ about real images, helping synthetic-to-real transfer. 🍞 Anchor: Marigold showed a diffusion model trained on synthetic depths can predict real-world depth well.

🍞 Hook: Imagine color-by-numbers, where each pixel number tells you exactly which spot on a 3D statue it belongs to.

🥬 The Concept (Dense Pixel-to-Surface Correspondence / Pixel-Aligned Proxies): A per-pixel label that says, “this pixel maps to that exact point on the 3D body surface.”

What it is: Two maps per image: segmentation (which body part) and UV coordinates (the precise spot on the surface’s texture).
How it works: (1) For each pixel, predict body part; (2) Predict its UV coordinate; (3) Use these to link image pixels to 3D mesh faces.
Why it matters: With dense, uniform supervision, fitting becomes a simple reprojection task instead of juggling many noisy cues. 🍞 Anchor: Each pixel in a sleeve is tagged to the correct place on the shirt’s 3D fabric.

The world before: Most HMR methods trained on real data with imperfect labels, inheriting artifacts from those labels. Multi-view methods should be powerful (seeing from many angles reduces confusion) but suffer from small, biased datasets and poor cross-dataset generalization. Synthetic datasets promise perfect supervision but face a domain gap.

The specific problem: Can we enjoy the exactness of synthetic labels and still work great on real photos without using real paired labels?

Failed attempts: (1) Heavy domain randomization—better, but not a full fix. (2) Directly regressing body parameters from images—suffers from domain gap. (3) Multi-view systems trained on limited real data—don’t generalize broadly.

The gap this paper fills: Use a diffusion model’s real-image prior to generate multi-view consistent, dense proxies learned from synthetic renders, then fit a standard body model. This sidesteps biased real labels and bridges the synthetic-to-real gap.

Real stakes: Better motion coaching, safer factory training, more accurate VR avatars, improved telepresence, physical therapy tracking, and reliable research baselines without expensive, imperfect annotations.

02Core Idea

🍞 Hook: Imagine a group of photographers drawing the same dots on a person’s outfit from different angles so a tailor can stitch a perfect costume that fits exactly.

🥬 The Concept (One-sentence Aha!): First, use a diffusion model to paint dense, multi-view-consistent pixel-to-surface labels (the ‘dots’); then fit the SMPL-X body to those labels with extra trust given to the most certain pixels.

Multiple analogies:

Paint-by-numbers to Sculpture: The diffusion model turns each pixel into a precise paint number (proxy). The sculptor (optimizer) then molds the 3D statue (SMPL-X) to match all those numbers from every view.
Crowd Wisdom with Confidence: Ask several friends (stochastic samples) the same question; trust answers where they agree (low uncertainty) and be careful where they argue (high uncertainty).
GPS Triangulation: Multiple cameras act like satellites; epipolar attention ensures all views line up to the same 3D spot, reducing confusion.

Before vs After:

Before: Models trained on noisy real labels inherit biases; synthetic-only training struggles on real photos; multi-view consistency often weak.
After: Dense diffusion-generated proxies provide uniform, pixel-perfect signals; epipolar attention aligns views; uncertainty weighting makes fitting robust; synthetic-only training achieves state-of-the-art on real benchmarks.

Why it works (intuition):

Diffusion priors learned from countless real images provide general visual understanding (textures, lighting, context), helping overcome the synthetic-to-real look gap.
Dense correspondences simplify the fitting objective: every foreground pixel directly says where it belongs on the body surface, turning many fragile cues into one strong one.
Multi-view epipolar attention ensures all cameras tell one coherent 3D story, not separate ones.
Uncertainty-aware test-time scaling lets the method down-weight risky regions and lean on confident ones.

Building blocks (with mini-explanations):

🍞 Hook: You know how a city map uses grid coordinates to find exact spots? 🥬 The Concept (UV Coordinates): UV are 2D surface coordinates that pinpoint exact locations on the 3D body’s ‘skin’ (texture).
- What: A (u,v) number pair for each surface point.
- How: The model predicts UV per pixel, linking images to the mesh.
- Why: Without UV, you only know the body part, not the precise place. 🍞 Anchor: Like “row 3, column 7” on a chessboard for a body surface.
🍞 Hook: If two cameras see the same point, their viewing rays meet a line of possible matches. 🥬 The Concept (Epipolar Attention): Use camera geometry so pixels in one view attend only to geometrically valid matches in other views.
- What: An attention rule constrained by epipolar lines.
- How: Embed viewing rays; let attention flow along valid correspondences across views.
- Why: Without it, views may disagree and hurt 3D consistency. 🍞 Anchor: It’s like tracing strings from two cameras to the same bead on a wire.
🍞 Hook: When zooming into a photo, tiny details pop out. 🥬 The Concept (Hand Refinement Module): A second pass that crops hands as extra views to improve finger detail.
- What: A coarse-to-fine two-pass refinement for hands.
- How: First get body proxies; then crop hands; run the generator again including hand views.
- Why: Without this, fingers blur or twist wrong because they’re only a few pixels in full-body views. 🍞 Anchor: Like using a magnifying glass for a detailed sketch.
🍞 Hook: If five friends draw slightly different maps, you average the best parts. 🥬 The Concept (Uncertainty-aware Test-time Scaling): Sample multiple proxy predictions and measure their disagreement per pixel.
- What: A way to get reliability maps that weight the fitting.
- How: Take the median UV and majority-vote segmentation; compute pixel-wise variance/disagreement as uncertainty; down-weight uncertain pixels in optimization.
- Why: Without it, outlier regions can yank the 3D fit in the wrong direction. 🍞 Anchor: Trust steady hands more when tracing a line.
🍞 Hook: Tailors fit patterns to fabric using pins at known spots. 🥬 The Concept (Reprojection Optimization): Fit the SMPL-X mesh so that the 3D points project back exactly to the labeled pixels.
- What: Optimize pose/shape so projections match the dense proxies.
- How: For each pixel, find its mesh face via part+UV; project; measure and reduce pixel error.
- Why: Without this step, proxies don’t become a coherent 3D mesh you can use. 🍞 Anchor: Adjusting a costume until every seam lines up with chalk marks.

Bottom line: Generate dense, consistent, and confidence-aware pixel labels with a diffusion prior, then fit a standard body model to them. That’s DiffProxy’s core.

03Methodology

High-level overview: Input multi-view images → Diffusion proxy generator (segmentation+UV per view with epipolar attention) → Hand refinement (add magnified hand views) → Test-time scaling (multiple samples → uncertainty maps) → Reprojection optimization to fit SMPL-X → Output final 3D mesh.

Step A: Proxy generation with a diffusion backbone

What happens: The model (built on Stable Diffusion 2.1 with a frozen UNet) predicts two images per view: (1) body-part segmentation (colors per part), and (2) UV coordinates (3-channel RGB encoding of (u,v)). It uses three conditionings: text to choose output type, a T2I-Adapter for pixel alignment, and DINOv2 tokens for pose/appearance priors. Special attention modules coordinate across modalities and views, including epipolar attention for geometric consistency.
Why it exists: Directly predicting 3D from RGB is sensitive to domain gaps. Dense proxies break the problem into a simpler, uniform signal that’s easier to transfer from synthetic to real with diffusion priors.
Example: Four cameras view a person raising a dumbbell. Each view’s proxy paints the biceps region consistently and assigns UV values that match the same surface spot across views.

🍞 Hook: Like different teachers (text, image features, geometry) helping the same student. 🥬 The Concept (Multi-conditional Mechanism): Multiple condition inputs guide the diffusion model.

What: Combine text, low-level image alignment, and high-level features.
How: Inject text via cross-attention, add T2I residual features, and feed DINOv2 tokens via image-attention.
Why: Without multi-conditions, outputs drift or miss pose details. 🍞 Anchor: It’s like cooking with a recipe (text), fresh ingredients (pixels), and chef’s intuition (DINOv2).

Step B: Hand refinement as extra views

What happens: First pass predicts full-body proxies and finds hand boxes. Second pass crops and magnifies hands; treat them as extra views and re-run the generator so cross-view attention sharpens fingers.
Why it exists: Hands are tiny in full images; zooming creates signal-rich, high-resolution supervision.
Example: A peace sign initially looks mushy; with hand crops, you see two straight fingers and a curved thumb.

Step C: Test-time scaling and uncertainty estimation

What happens: Run K stochastic diffusion samples per view. For UV, take pixel-wise median; for segmentation, take majority vote. Compute per-pixel uncertainty from variance/disagreement to create a weight map.
Why it exists: Some regions (occlusions, fast motion, shiny clothes) are harder; the method should rely less on shaky pixels.
Example: Two views argue whether a leg is left or right; uncertainty spikes there, so fitting trusts other confident views more.

Step D: SMPL-X fitting via reprojection

What happens: For each foreground pixel in each view, use its (part, UV) to locate the exact 3D surface point on SMPL-X. Project that 3D point back to the camera and minimize pixel distance to the original pixel, weighted by the uncertainty map. Optimize stage-wise (global orientation/translation/scale, then body pose/shape, then hands), using L-BFGS.
Why it exists: This turns a messy multi-cue fitting problem into one clean objective tied directly to dense correspondences.
Example: In a side view, the elbow pixels map to the elbow surface; as optimization proceeds, the projected elbow aligns across views and locks into place.

Secret sauce: Three clever ingredients

Diffusion prior + synthetic perfection: The diffusion backbone brings real-image common sense; synthetic training brings exact, dense labels. Together, they bridge domain gaps without real paired labels.
Epipolar attention for multi-view coherence: Views support each other through geometric constraints so the 3D story matches from every angle.
Uncertainty-weighted optimization: The system knows when it’s unsure and reduces the influence of those pixels, making the final 3D robust in occlusions, partial views, and tricky lighting.

Practical details (friendly):

Training data: ~108K multi-view subjects (≈868K images) rendered with diverse motions, HDR lighting, hair, clothes, objects, and 8 randomized cameras per subject. Proxies are 256×256.
Training: Fine-tune only adapters and attention; keep UNet and DINOv2 frozen. Also lightly fine-tune the VAE decoder for precision.
Inference: Default 12 views total (4 body views + left/right hand crops per body view), two-pass for hand boxes, K=5 samples for uncertainty, and about 60 seconds for fitting (≈120 seconds end-to-end).

What breaks without each step:

No epipolar attention: Views may disagree, causing warped 3D.
No hand refinement: Fingers blur and mislabel, reducing grasp accuracy.
No uncertainty weighting: Occasional wrong regions pull the mesh off.
No dense proxies: You’re back to noisy keypoints and silhouettes with hand-tuned weights.

04Experiments & Results

The test: What did they measure and why?

Metrics: MPJPE/PA-MPJPE (joint errors) and MPVPE/PA-MPVPE (vertex errors), where ‘PA’ means errors after aligning for global rotation/scale (to focus on pose/shape quality). Lower is better.
Why these: Joints and surface vertices capture both skeletal and full-body accuracy. PA versions discount camera/scale quirks to measure true reconstruction quality.

The competition: Baselines include single-view giants and multi-view systems.

SMPLest-X: A very strong single-view model trained on massive datasets (often on each target dataset).
Human3R, U-HMR, MUC, HeatFormer, EasyMoCap: Mix of parametric regressors and optimization methods, some multi-view and some using dataset-specific training.

The scoreboard (context-rich):

DiffProxy, trained only on synthetic data, achieves the best or near-best numbers on five real-world datasets: 3DHP (studio), BEHAVE (human–object), RICH (contacts), MoYo (challenging poses/outdoor), and 4D-DRESS (loose clothing). For example, on MoYo it reports about 36.2 mm PA-MPJPE—think of this like scoring an A when others score B’s—even though it never saw real training labels.
On 4D-DRESS and 4D-DRESS-partial (random crops mimic partial views), DiffProxy remains top-tier. Handling missing parts is where uncertainty weighting shines.
Hands get noticeably better with refinement: on a hands-only evaluation, PA-MPVPE improves from 17.7 mm to 16.6 mm, and MPJPE from 55.8 mm to 37.5 mm, indicating sharper finger articulation.

Surprising findings:

Zero-shot generalization: Training on synthetic only, using diffusion priors, still beats or matches methods trained on real data, avoiding real-label biases (like consistent head-tilt artifacts).
More views help steadily: Going from 1 to 8 views drops errors dramatically, as expected from stronger geometry; 4 views already provide a big leap.
Test-time scaling is an easy win: Increasing K from 1 to 5 consistently reduces errors (like taking more careful measurements); going to 10 adds a tiny extra gain.
Works without ground-truth cameras: With cameras estimated by another model, then refined during fitting, accuracy degrades modestly but remains competitive—important for real deployments.

What the numbers mean in plain terms:

A joint error around 20–40 mm means wrists, knees, and elbows are typically within a couple of centimeters—good enough for many avatar and coaching uses.
Surface errors in the 30–50 mm range indicate clothing drapes and limb shapes align closely, even with occlusions and tough lighting.

Takeaway: Dense, diffusion-generated proxies plus uncertainty-aware fitting produce reliable 3D humans across varied real-world scenes without real paired training data.

05Discussion & Limitations

Limitations:

Speed: Around 120 seconds per subject (diffusion sampling + optimization). Not yet real-time.
View count: Single-view struggles with depth ambiguity; multi-view (e.g., 4+) is recommended for best results.
Single subject: The current pipeline assumes one person; multi-person needs identity tracking and instance-aware conditioning.
Resolution trade-offs: Proxies are 256×256; very small accessories or extreme finger poses may still be challenging.
Camera calibration: Works best when cameras are known; estimation is possible but slightly reduces accuracy.

Required resources:

A GPU setup (authors used 4× RTX 5090 for training; inference needs a good GPU too).
Multi-view images and approximate camera parameters (or a separate camera estimator).
Storage for synthetic training data and renderings.

When not to use:

Real-time interactions (e.g., live AR filters on phones) where 2 minutes per subject is too slow.
Single, very low-resolution images with heavy motion blur.
Crowded scenes with multiple people heavily overlapping unless extended for multi-person.

Open questions:

Can we distill or use consistency models to speed up diffusion sampling 10–50×?
How to scale to multi-person with cross-view identity consistency and minimal extra overhead?
Can clothing dynamics and loose garments be modeled more explicitly during fitting?
Can we raise proxy resolution adaptively only where needed (e.g., hands/face) to balance speed and detail?
How far can synthetic diversity push zero-shot performance without any real fine-tuning?

06Conclusion & Future Work

3-sentence summary:

DiffProxy trains a diffusion-based proxy generator on large-scale synthetic multi-view data to produce dense, multi-view-consistent pixel-to-surface correspondences.
It then fits a SMPL-X mesh to those proxies with uncertainty-aware weighting, yielding accurate 3D humans—even on real photos it never trained on.
This combination beats prior methods across five benchmarks, handling occlusions, partial views, and tricky poses.

Main achievement:

Showing that diffusion-generated dense proxies can bridge synthetic-to-real gaps and power state-of-the-art multi-view human mesh recovery without real paired labels.

Future directions:

Speed up with distilled/consistency models; extend to multi-person; improve clothing dynamics; push adaptive high-resolution proxies for hands/face; and reduce reliance on precise camera calibration.

Why remember this:

DiffProxy reframes HMR: don’t learn 3D directly from pixels—first paint precise, confidence-aware, multi-view-consistent pixel labels using a diffusion prior, then fit a clean geometry model. It’s a simple, powerful recipe that travels well from synthetic worlds to the real one.

Practical Applications

•Create accurate full-body VR avatars from a few phone cameras without special suits.
•Sports form coaching with multi-view home cameras, focusing on knees, hips, and spine alignment.
•Physical therapy progress tracking by reconstructing precise joint angles over time.
•Telepresence and virtual try-on where clothes and body alignment need to look correct from all angles.
•Workplace ergonomics analysis to reduce injury risk through posture reconstruction.
•Film/game production previsualization using dense proxies to quickly fit 3D doubles to actors.
•Human–robot collaboration safety checks by monitoring 3D human pose near machines.
•Motion database creation for AI animation trained from multi-view footage without motion-capture rigs.
•Forensics or sports officiating assistance by reconstructing 3D scenes from multi-camera videos.
•Research benchmarking with bias-reduced training, enabling fairer comparisons across datasets.

Version: 1