CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

Zihan Wang; Jiashun Wang; Jeff Tan; Yiwen Zhao; Jessica Hodgins; Shubham Tulsiani; Deva Ramanan

CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

Intermediate

Zihan Wang, Jiashun Wang, Jeff Tan et al.12/16/2025

arXiv PDF

Key Summary

•CRISP turns a normal phone video of a person into a clean 3D world and a virtual human that can move in it without breaking physics.
•The key trick is to rebuild the scene using a small number of flat, Lego-like pieces (planar primitives) instead of a messy, heavy mesh.
•CRISP uses where the person clearly touches things (contacts) to fill in hidden surfaces like the top of a chair seat or a stair tread.
•A simulated humanoid, trained with reinforcement learning, is used as a physics ‘spell-check’ to make sure the reconstruction actually works in motion.
•Compared to previous methods, CRISP makes the motion-tracking failures drop by about eight times (from 55.2% to 6.9%).
•It also speeds up training by 43% (23K vs. 16K FPS), so learning happens faster.
•Even if the full scene isn’t perfectly complete, the important contact surfaces are cleaner, so the robot doesn’t trip on ‘ghost bumps.’
•The method works on standard benchmarks (EMDB, PROX) and in-the-wild videos, including Internet and even AI-generated clips.
•Physics in the loop improves both the human motion and the scene, showing that better geometry makes better learning.
•This makes it practical to learn robot skills and AR/VR interactions directly from everyday videos.

Why This Research Matters

CRISP makes it possible to learn robot and avatar skills directly from everyday videos, not just from expensive motion-capture studios. By rebuilding scenes with clean, flat supports and checking them with physics, it creates safe and realistic practice grounds for virtual humans. This speeds up training and reduces failures, which is crucial for robotics where mistakes can be costly or dangerous. For AR/VR, it means avatars that sit, step, and lean naturally in reconstructed spaces instead of sliding or clipping. It also democratizes data: anyone with a phone can capture training videos that become usable simulations. Over time, this could unlock large-scale, physically grounded learning for home robots, assistive tech, and immersive experiences.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine trying to copy a dance you saw on a video, but the floor in your room is bumpy and invisible in places. You’d slip, stumble, and give up—even if the dance moves were perfect.

🥬 The Concept (Why this research exists): Before CRISP, computers could guess 3D humans and scenes from videos, but the results were often messy and not ready for real physics. Tiny bumps in the reconstructed ground or missing parts of a chair seat could make a simulated person fall, slide, or get stuck. A good video-to-simulation system must rebuild the person and the world so that the virtual person can move safely and realistically.

How it worked before:

AI was great at recognizing people and poses (2D/3D) and okay at making point clouds or meshes of scenes.
But it struggled when people interacted with things—like sitting, climbing, or leaning—especially with occlusions or camera motion.
Dense meshes from methods like TSDF or neural fields could be big, noisy, and full of tiny artifacts that break physics simulations.

Why that was a problem:

Physics engines are picky. A tiny spike in the floor can send a virtual human flying.
Missing support (like the hidden part of a stair) makes a simulated person fall even if the original video showed a stable step.
As a result, policies learned from such geometry failed often and trained slowly.

🍞 Anchor: Think of building a playground from cardboard. If your cardboard floor has random bumps or the bench seat is missing, any Lego minifigure you place will tip over. Clean, flat pieces in the right spots make play smooth and safe.

🍞 Hook: You know how you can tell when your foot is firmly on the ground versus almost touching it? That feeling of contact is incredibly precise.

🥬 Contact Modeling: Contact modeling is figuring out exactly which body parts are truly touching the scene and using that information to guide reconstruction.

How it works: (1) Detect likely body-surface touches in the video; (2) Filter out ‘almost-touching’ moments; (3) Use confirmed contacts to complete hidden supports (e.g., the top of a chair seat under a person).
Why it matters: Without true contact understanding, the system either hallucinates wrong supports or forgets needed ones, causing falls and sliding.

🍞 Anchor: If the video shows someone sitting, contact modeling says, “There must be a seat right under the hips,” and fills it in.

🍞 Hook: Picture laying down a few sturdy floor tiles instead of painting a giant, lumpy carpet.

🥬 Planar Scene Primitives: These are simple, flat 3D pieces (like thin rectangles) we use to rebuild the world.

How it works: (1) Start from a point cloud; (2) Group points with similar surface directions; (3) Split and merge groups across frames; (4) Fit clean planes; (5) Turn planes into thin boxes for simulation.
Why it matters: Physics engines love clean, convex shapes. They’re fast to check for collisions and don’t have surprises that trip up a virtual human.

🍞 Anchor: A sidewalk built from a few flat tiles is safer than one made from a thousand pebbles glued together.

🍞 Hook: Think of teaching a puppy tricks with treats.

🥬 Reinforcement Learning (RL): RL trains a virtual human controller by rewarding it for following the video’s motion while obeying physics.

How it works: (1) Show a reference pose; (2) The controller tries an action; (3) It gets rewarded for matching the pose smoothly without collisions; (4) Repeat until it learns.
Why it matters: If the scene is wrong, the learner fails or learns bad habits; if the scene is clean, learning is fast and stable.

🍞 Anchor: The better the training room (flat floor, proper chair), the quicker the puppy learns to sit and stay.

🍞 Hook: Close one eye and still guess how far the door is—that’s monocular depth estimation.

🥬 Monocular Depth Estimation: It’s estimating how far things are using a single camera view.

How it works: A neural network predicts a depth map per frame, turning pixels into distances.
Why it matters: Without depth, you can’t lift pixels into 3D, so you can’t find planes or contacts.

🍞 Anchor: A depth map turns a flat photo of stairs into a 3D staircase you can step on in simulation.

🍞 Hook: When you hike, you track where you are while building a mental map—both at once.

🥬 Visual SLAM: Visual SLAM estimates the camera’s path and builds a map of the scene from video.

How it works: (1) Track features across frames; (2) Estimate camera motion; (3) Reconstruct a 3D point cloud.
Why it matters: Without a steady world coordinate system, the person and the scene won’t line up for physics.

🍞 Anchor: SLAM is the phone’s ‘map and compass,’ so the virtual human stands on the right floor, not floating.

🍞 Hook: Turning a stick figure drawing into a posable doll.

🥬 Human Mesh Recovery (HMR): HMR builds a full 3D body (like SMPL) from video frames and places it in the world.

How it works: (1) Predict body pose and shape in camera space; (2) Use the camera’s pose to move it into world space; (3) Match human scale to set real sizes.
Why it matters: Without a correctly sized, correctly placed human, contacts and physics can’t be trusted.

🍞 Anchor: If the person is too small or too big, the chair height won’t match, and sitting becomes impossible.

🍞 Hook: Imagine sprinkling thousands of tiny dots to ‘sculpt’ a room.

🥬 Point Cloud Reconstruction: A point cloud is a 3D ‘dot’ version of the world from depth and SLAM.

How it works: Each pixel’s depth turns into a 3D point; combine frames to grow a global cloud.
Why it matters: These dots are the raw material for fitting clean planes.

🍞 Anchor: From the dot-cloud of a staircase, we fit the flat steps the feet can stand on.

02Core Idea

🍞 Hook: You know how a neat stack of flat Lego plates makes a sturdier build than a pile of crumbly foam? The flat pieces keep everything steady.

🥬 The “Aha!” in one sentence: If we rebuild the world from a few clean, flat pieces guided by where the person actually touches things—and then test it by making a virtual human move there—we get stable, fast, and faithful motion from ordinary videos.

Multiple analogies:

Tiles not mush: Using planar pieces is like laying tiles on a floor instead of smoothing wet sand—you avoid hidden bumps that trip you.
Cookie-cutter clarity: Planar fits ‘cut out’ clean surfaces from noisy dots, like a cookie cutter makes neat shapes from dough.
Physics spell-check: The RL humanoid is a proofreader; if the world is wrong, it ‘red-pencils’ it by failing, pushing us to fix geometry.

Before vs. After:

Before: Dense, lumpy meshes; missing supports; frequent penetrations; slow simulation; controllers learn slowly and fail often.
After: Dozens (≈50) of clean convex planes; contact-completed supports; stable, fast collisions; controllers learn quickly with high success.

Why it works (intuition):

Collision simplicity: Convex, flat shapes make collision checks reliable; fewer ‘gotchas’ than bumpy meshes.
Contact truth: Contacts tell you where surfaces must be; using them fills occluded seats and step-tops.
Physics in the loop: Training a controller forces the geometry to be realistically usable; errors surface early.

Building blocks (with mini ‘sandwich’ anchors):

🍞 Hook: Sorting crayons by color. 🥬 K-Means Clustering: Groups points with similar surface directions (normals) to suggest plane candidates; without grouping, planes get mixed up. 🍞 Anchor: All ‘upward-facing’ floor points land in the same bin.
🍞 Hook: Finding flower patches and ignoring weeds. 🥬 DBSCAN: Splits spatially separate chunks so two different walls don’t merge; without it, distant planes can get glued together. 🍞 Anchor: Two walls facing the same way but on opposite sides of a room stay apart.
🍞 Hook: Drawing the best-fit line despite a few bad dots. 🥬 RANSAC: Fits a clean plane while ignoring outliers; without it, a few noisy points tilt the whole plane. 🍞 Anchor: A smooth tabletop fit that ignores a stray book.
🍞 Hook: Watching a moving stripe to see where it goes next. 🥬 Optical Flow: Tracks which plane pieces across frames are actually the same surface; without it, you’d get duplicates. 🍞 Anchor: The same stair tread seen in two frames gets merged.
🍞 Hook: Giving the right-sized helmet to a biker. 🥬 Scale from Human: Match scene scale using known human size; without it, meters and steps don’t match reality. 🍞 Anchor: A 0.05 m seat thickness makes sense only if the human is life-sized.

🍞 Anchor: Put together, these pieces turn a noisy 3D ‘snowstorm’ of dots into tile-like surfaces that a virtual person can safely use. That’s CRISP in a nutshell.

03Methodology

At a high level: Input video → camera and depth (Visual SLAM + depth) → global point cloud → planar primitive fitting → contact-guided completion → 3D human (HMR) aligned in world and scale → reinforcement learning controller → physics-checked motion output.

Step A: Build a stable 3D world from a single camera

🍞 Hook: Walking with a map and judging distances with one eye.
🥬 Visual SLAM + Monocular Depth: We estimate the camera’s path and a depth map for each frame, then lift pixels into 3D to form a global point cloud. We use MegaSAM with MoGe depth for crisp geometry. Why this step exists: Without a stable map and distances, we can’t make reliable planes or place the human. Example: A 10-second clip of stair climbing becomes a point cloud where each stair tread appears as a flat cluster of points.
🍞 Anchor: Now we have a shared world where both the scene and the person will live.

Step B: Turn noisy dots into clean, flat pieces (planar primitives)

Compute normals: From the point cloud, estimate a ‘which-way-is-the-surface-facing’ vector for each point.
🍞 Hook: Grouping crayons by pointing direction. 🥬 K-Means on normals: Cluster points by similar normals to find candidate planes. Why it matters: Mixing directions muddles surfaces; grouping seeds plane regions. Example: All floor-like points with upward normals cluster together.
🍞 Hook: Keeping far-apart piles separate. 🥬 DBSCAN spatial split: Inside each normal-cluster, split distant blobs so two different walls don’t fuse. Why it matters: Separates look-alike but far-apart surfaces. Example: Left and right walls (same normal direction) become two segments.
🍞 Hook: Watching where the same sticker moves next frame. 🥬 Temporal merge with optical flow: Link segments across frames when they match in overlap and orientation, creating a single, consistent plane over time. Why it matters: Prevents duplicate planes that confuse physics. Example: A stair tread seen from two angles merges into one plane.
🍞 Hook: Best-fit line despite bad dots. 🥬 RANSAC plane fit: Fit a clean plane to each merged region, ignoring outliers. Why it matters: A few stray points shouldn’t tilt the whole plane. Example: A tabletop plane stays flat even if a mug adds noisy points.
Make simulation shapes: Turn each plane into a thin rectangular cuboid (default 0.05 m thick) oriented by the plane axes. Why it matters: Physics engines (like Isaac Gym) collide faster and more stably with a few convex boxes than with huge, bumpy meshes. Example: A room becomes ~50 boxes: floor tiles, wall panels, stair treads, table tops.

Step C: Use contacts to fill in hidden supports

🍞 Hook: If someone is sitting, you know a seat is under them—even if you can’t see it.
🥬 Contact Modeling: We predict which body parts touch the scene (using a vision-language model) and filter out ‘almost-touching’ frames by requiring steady contact over time with low body motion. Why it matters: Hidden supports (chair seats, stair platforms) must exist for physics to work. Example: If hips are in contact for multiple frames, we add a seat plane at that spot.

Step D: Put the human into the world at the right size

🍞 Hook: Trying on a helmet that actually fits.
🥬 Human Mesh Recovery (HMR) + Scale: From GVHMR we get the 3D body in camera space and move it into world space using SLAM poses. We fix the unknown world scale by matching a normal human size. Why it matters: If the person is the wrong size, stairs and chairs won’t line up. Example: Scale the world so the virtual person’s pelvis height matches expected human proportions.

Step E: Train a virtual human to follow the video (physics-checked)

🍞 Hook: Teaching a puppy to follow moves with treats.
🥬 Reinforcement Learning Motion Tracking: A controller sees the current body state and a short window of future target poses from the video. It outputs joint targets for a PD controller. Rewards encourage matching pose, position, and smoothness. We use PPO in Isaac Gym at 120 Hz sim and 30 Hz control. Why it matters: This proves the scene and motion are physically usable. If something’s off (like a missing step), the agent fails or learns oddly. Example: On good geometry, a stair-climb clip is tracked end-to-end with stable foot contacts.

Secret sauce (why this recipe is clever):

Small, convex, planar pieces are both noise-robust and simulator-friendly.
Contacts act like ‘X marks the spot’ for occluded supports.
RL serves as a physics-based validator that improves motion smoothness and reveals geometry mistakes early.

What breaks without each step:

No SLAM/depth: Human and scene won’t share a stable world; contacts won’t line up.
No planar fitting: Dense, noisy meshes cause collisions to explode and learning to stall.
No contact completion: Hidden supports go missing; agents fall.
No scale: Steps and seats are the wrong size; tracking fails.
No RL: You don’t know if the reconstruction is truly simulatable; small errors stay hidden until deployment.

04Experiments & Results

The test (what we measured and why):

RL success rate: Does the humanoid finish the whole motion without drifting too far from the reference?
Simulation throughput (FPS): How fast the training runs (higher FPS means more learning per second).
Geometry quality: Chamfer Distance (lower is better) and directional versions (Recon→GT for precision; GT→Recon for completeness), plus Non-Penetration (how often the human avoids intersecting the scene).
Human accuracy: World-grounded joint error (W-MPJPE and WA-MPJPE), root translational error, and smoothness.

🍞 Hook: Measuring how close two shapes are, like checking puzzle piece fit. 🥬 Chamfer Distance: A score of how far points on one shape are from the other.

How it works: For each point from A, find the nearest in B (and vice versa for two-way CD), and average distances.
Why it matters: Lower means the reconstruction hugs the real scene more closely; high completeness and precision. 🍞 Anchor: A low Recon→GT means your tiles sit right on the real floor and steps.

🍞 Hook: Walking through a doorway without clipping the frame. 🥬 Non-Penetration: The percent of time the human doesn’t intersect the scene.

How it works: Check intersection between body and surfaces over time.
Why it matters: Intersections mean impossible physics. 🍞 Anchor: A high score means no ghost-walking through walls or floors.

Competitors and settings:

Baselines: VideoMimic (dense mesh pipeline), TSDF mesh fusion, NKSR (sharp neural surface reconstruction).
Datasets: EMDB (global motion ground truth; outdoor/indoor) and PROX (indoor scenes with 3D scans).
Same RL framework for everyone to keep tests fair.

Scoreboard (with context):

RL success: CRISP’s planar primitives reach 93.1% success, more than double VideoMimic’s 44.8%. That’s like going from a barely passing grade to an A.
Throughput: 23K vs. 16K FPS (~43% faster), so training is quicker with our light geometry.
Geometry: CRISP’s Recon→GT Chamfer is very low (0.174–0.187), showing our surfaces sit where they matter. Bidirectional Chamfer can be slightly higher than NKSR because we skip tiny, non-contact details—but that’s a smart trade for physics stability.
Non-Penetration: Highest with planar primitives (≈0.947), meaning fewer impossible interpenetrations.
Human accuracy (EMDB): After RL refinement, CRISP gets WA-MPJPE ≈ 70.6 mm and W-MPJPE ≈ 175.9 mm, the best among compared methods, with reduced jitter and drift.

Surprising findings:

Cleaner but simpler beats over-detailed: Even if we don’t model every tiny object, having the important flat supports cleanly modeled makes physics and learning much better.
Physics improves vision: Putting the human into a physics loop actually reduced pose errors and jitter, suggesting that ‘learning by doing’ helps fix visual rough edges.
Contact guidance matters most when supports are hidden: Adding a few contact-completed planes rescued whole motions (e.g., sitting and stair stepping) that otherwise failed.

Bottom line: CRISP’s scene-as-tiles design plus contact guidance turns out to be the sweet spot—fast to simulate, accurate where it counts, and much more reliable for learning controller policies.

05Discussion & Limitations

Limitations (be specific):

Planar-only shapes: Highly curved or organic surfaces become ‘faceted’ approximations. This usually doesn’t hurt walking/sitting but can look less complete.
No soft or fluid stuff: Cushions that squish or water that flows aren’t modeled. The method assumes rigid, static scenes.
Static scenes only: Moving objects or people beyond the main actor aren’t handled.
Contact errors can propagate: If human pose or contact prediction drifts, a completed plane may be slightly misaligned.
Small gaps between tiles: Visual incompleteness can appear at tile seams; physics is usually fine, but it can look a bit patchy.

Required resources:

A decent GPU and physics simulator (e.g., Isaac Gym) for RL training.
Visual SLAM and depth networks (MegaSAM + MoGe) are the main runtime bottlenecks.
Short training (minutes-to-hours depending on clip count) for policy tracking.

When not to use:

Videos with many moving scene parts (doors swinging, crowds) or deformables (pillows deforming a lot).
Applications demanding ultra-detailed meshes for rendering rather than physics.
Tasks needing hand-object manipulation with complex contact beyond planar supports (for now).

Open questions:

Richer primitives: Can we mix planes with other convex shapes (e.g., superquadrics) to keep stability but capture curves?
Dynamics: How do we extend to moving objects and interactive manipulation?
Real-time loop: With faster SLAM/depth, can we do live video-to-sim for AR/VR telepresence?
End-to-end physics-in-the-loop: Can the controller’s failures automatically guide geometry fixes during reconstruction?
Scene-aware policies: What’s the best way to include geometry at runtime without hiding reconstruction issues during evaluation?

06Conclusion & Future Work

Three-sentence summary: CRISP converts everyday monocular videos into simulation-ready worlds by fitting a handful of clean, flat, convex surfaces and using contact cues to fill in hidden supports. A physics-trained humanoid then tracks the recovered motion, acting as a reality check that exposes and reduces errors, yielding stable, fast, and faithful interactions. This leads to big gains in reliability (93.1% success) and speed (23K FPS), making video-to-simulation practical for robotics and AR/VR.

Main achievement: Showing that contact-guided planar primitives—combined with physics-in-the-loop learning—are the key to turning noisy reconstructions into robust, efficient, and accurate simulation assets.

Future directions: Add richer convex primitives to capture curves, support dynamic objects and manipulation, and push toward real-time pipelines with tighter coupling between reconstruction and control. Scene-aware policies can then be layered on for deployment while keeping scene quality measurable.

Why remember this: When building worlds for physics, clean and correct beats dense and decorative. By focusing on the flat supports where humans actually touch, CRISP delivers simulations that work in practice and learn fast—unlocking large-scale skills from ordinary videos.

Practical Applications

•Train home robots to navigate stairs and sit/stand tasks by learning from phone videos of people doing them.
•Create AR/VR avatars that interact naturally with reconstructed rooms (sit on chairs, lean on tables) without clipping.
•Generate physics-ready practice environments from sports videos for coaching balance, stepping, and safe landings.
•Bootstrap humanoid locomotion datasets from Internet clips, speeding up policy learning.
•Pre-visualize stunts or choreography by turning rehearsal videos into stable simulation environments.
•Rapidly prototype game levels by extracting clean walkable surfaces and platforms from concept videos.
•Reconstruct ergonomics scenarios (e.g., factory workcells) from short recordings to evaluate posture and reach.
•Use contact-completed scenes to test assistive devices (canes, walkers) on realistic steps and ramps.
•Convert AI-generated (e.g., Sora) videos into physics-based training data for robust controller pretraining.
•Teach rehab robots or exoskeletons stable stepping and sitting behaviors from therapist demonstration videos.

Version: 1