InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions

Sirui Xu; Samuel Schulter; Morteza Ziyadi; Xialin He; Xiaohan Fei; Yu-Xiong Wang; Liangyan Gui

InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions

Intermediate

Sirui Xu, Samuel Schulter, Morteza Ziyadi et al.2/5/2026

arXiv PDF

Key Summary

•InterPrior is a new brain for simulated humans and humanoid robots that can move, balance, and use objects by following simple goals instead of step-by-step instructions.
•It learns in three stages: first by carefully copying expert demonstrations, then by turning those skills into a flexible goal-driven generator, and finally by practicing with rewards to handle surprises and mistakes.
•The controller understands different kinds of goals: a single future snapshot, a short path (trajectory), or who/where to touch (contact).
•Reinforcement learning finetuning is the key that turns good copying into robust doing, especially when things go off-script.
•Across tough tests, InterPrior succeeds more often than strong baselines, recovers from failed grasps, and handles thin or tricky objects better.
•It generalizes to new objects and even new datasets without retraining from scratch, and supports real-time steering by a human user.
•A shaped latent space and safe bounds help it stay natural and stable while still being creative in how it moves.
•The same recipe also works for a real humanoid platform in simulation (Unitree G1), suggesting a path to real-world robots.

Why This Research Matters

Robots and animated characters need to move in ways that are both believable and reliable, even when the world throws curveballs. InterPrior shows how to train a single controller that understands simple goals and still handles bumps, slips, and imperfect data. This lowers the effort needed to create long, complex interactions in films, games, and VR, because fewer hand-authored fixes are required. In homes or factories, it points toward helpers that can pick up varied objects safely, retry when needed, and adapt to new items. For teleoperation, it means a human can steer at a high level while the controller manages balance, contacts, and recovery in real time. Overall, it’s a practical path to robots that not only copy what we do but also keep doing it when things go off-script.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you think, “I want that water bottle,” and your body just figures out the steps—walk, reach, grasp—without you planning every tiny muscle move?

🥬 The Concept (Human-Object Interaction, HOI): HOI means people use their whole body to move around (locomotion) and handle things (manipulation) in the real, physical world. How it works:

Your brain sets a simple goal (grab bottle).
Your body keeps balance and makes contacts (feet on floor, hand on bottle).
Physics (gravity, friction) shapes what’s possible. Why it matters: If a robot doesn’t understand HOI, it trips, slips, or pokes through objects like a ghost. 🍞 Anchor: Picking up a box: you walk to it, bend safely, grip the sides, and lift without falling.

🍞 Hook: Imagine playing with a ragdoll in a video game. If the physics are wrong, it falls through the floor or can’t pick up anything.

🥬 The Concept (Physics-Based Simulation): A physics-based simulator is a virtual world with rules like gravity and friction, so motions are tested for real-world believability. How it works:

The simulator tracks bodies, joints, and forces.
The policy (the robot’s brain) outputs joint targets.
A controller turns targets into torques that move the body. Why it matters: Without physics, motions look nice but fail the instant they touch real stuff. 🍞 Anchor: If the floor is slippery in the sim, the character must step carefully or it will slide—just like on ice.

The world before InterPrior: Many AI controllers could copy specific videos or follow a pre-made plan. But they needed dense, frame-by-frame instructions for both the human and the object. That’s like needing someone to tell you every step to tie your shoes, every single time. When the plan was a little wrong (thin objects, unexpected bumps), the robot often failed because it didn’t know how to adjust on its own.

The problem: Real life is messy. Objects are different shapes and weights, contacts can slip, and goals can change mid-task. If you train only to copy exact examples, you get brittle behavior. If you train only with rewards and no examples, you can get weird, unnatural ‘reward hacks’ that look nothing like human movement.

Failed attempts:

Adversarial priors (GAN-like): Encouraged variety but were hard to train and scale, sometimes collapsing.
Pure imitation: Great at copying training data but fell apart when the scene or object changed.
Pure reinforcement learning (RL): Could chase rewards but sometimes moved unnaturally or exploited loopholes.

The gap: We needed a controller that:

Understands simple, high-level goals (like “touch the handle in one second”),
Stays physically natural,
Handles surprises (missed grasps, pushes),
And generalizes to new objects and setups.

This paper’s idea: Combine the strengths of imitation (be natural) and RL (be robust) in a smart order. First learn to move well by copying, then distill those skills into a flexible, goal-following generator, and finally practice with rewards to become tough and adaptable.

Real stakes: This matters for animation (fewer hand-animated fixes), VR/AR avatars (move believably), assistive robots (pick up varied household items safely), and teleoperation (smooth, robust control). It’s the difference between a robot that freezes when it misses a grasp and one that calmly repositions and tries again.

02Core Idea

🍞 Hook: Imagine you tell a friend, “Put the book on the shelf,” and they just do it—even if the shelf is a bit higher than usual or the book is slippery.

🥬 The Concept (Generative Control): Generative control is a way of creating many possible, natural motions that all accomplish the same goal, instead of following one rigid script. How it works:

Learn what natural, successful motions look like.
Given a goal, sample a motion ‘style’ from a learned skill space.
Execute while staying balanced and adapting to physics. Why it matters: Without generative control, the robot either copies exactly (brittle) or improvises badly (unnatural). 🍞 Anchor: There are many good ways to put a book on a shelf—left hand, right hand, small step, big step. Generative control picks a good one for the moment.

🍞 Hook: When you set your GPS destination, you don’t need step-by-step human directions.

🥬 The Concept (Goal-Conditioned Policy): A goal-conditioned policy is a brain that acts based on a simple goal (what you want) plus what it currently sees (where you are now). How it works:

Read the current human-object state.
Read a goal (like a future pose or contact).
Output joint targets that move toward that goal. Why it matters: Without goals, the robot doesn’t know what to aim for; with too-detailed goals, it can’t adapt when things change. 🍞 Anchor: “Touch the mug handle in one second” is enough; you don’t need to plan every elbow angle.

🍞 Hook: Think of taking a quick snapshot of where you want to be in the near future.

🥬 The Concept (Snapshot Goals): A snapshot goal is a single future target configuration to reach. How it works:

Pick a future time (e.g., one second from now).
Specify only key parts (like hand and mug pose).
Move to match those parts while keeping balance. Why it matters: Without snapshots, the robot might over-plan; snapshots let it flexibly fill in the details. 🍞 Anchor: “One second from now, have your hand on the door handle.” The body figures out steps and reach.

🍞 Hook: Following dots on the ground helps you walk a path.

🥬 The Concept (Trajectory Goals): Trajectory goals are short sequences of future waypoints. How it works:

Give a few future mini-goals.
The policy tracks them while adapting to physics.
If it drifts, it can still re-align at the next waypoint. Why it matters: Without waypoints, long tasks can drift; with them, the robot stays on course. 🍞 Anchor: “Step here, then here, then grab.”

🍞 Hook: High-fives and handshakes are all about touching the right spot at the right time.

🥬 The Concept (Contact Goals): Contact goals say who should touch what and when. How it works:

Mark desired contact regions (e.g., right hand with mug handle).
Encourage approaching, aligning, and gripping.
Keep balance and avoid penetration. Why it matters: Without contact goals, grasps can miss or slide. 🍞 Anchor: “Pinch the thin handle with fingertips” beats “move near the mug.”

🍞 Hook: Kids learn by watching; athletes improve by practice.

🥬 The Concept (Imitation Learning): Imitation learning teaches by copying good examples. How it works:

Collect demonstrations of successful interactions.
Train a policy to match them.
Get natural-looking movements fast. Why it matters: Without imitation, motions can look robotic and awkward. 🍞 Anchor: Copy how a person lifts a box safely—back straight, feet planted.

🍞 Hook: Practicing free throws under different winds makes you better on game day.

🥬 The Concept (Reinforcement Learning Finetuning): RL finetuning lets the robot practice and get rewards for success, robustness, and recovery. How it works:

Start from a good imitator (so it moves naturally).
Add goal-based rewards and random bumps.
Learn to recover from mistakes and reach goals anyway. Why it matters: Without RL, the policy stays brittle; without imitation, it gets weird. 🍞 Anchor: If a grasp slips, the robot tries again instead of giving up.

🍞 Hook: Choosing a paint color sets the vibe for the whole room.

🥬 The Concept (Variational Policy and Latent Space): A variational policy samples a ‘skill code’ (a latent) that shapes how the motion unfolds toward a goal. How it works:

Encode recent history and goals.
Sample a latent skill on a safe hypersphere.
Decode it into joint targets. Why it matters: Without a good latent space, motions can be dull (no variety) or unstable (too wild). 🍞 Anchor: For the same ‘pick up box’ goal, the latent chooses a style—left-side lift vs. right-side lift.

Aha! Moment in one sentence: Start with natural moves (imitation), compress them into a flexible, goal-driven generator (variational policy), then harden it with practice under pressure (RL) so it stays natural and gets robust.

Before vs after:

Before: Needed dense references; brittle with surprises; hard to scale.
After: Follows simple goals; adapts to pushes and slips; composes skills over long horizons.

Why it works (intuition): Imitation sets the robot’s ‘motor memory’ to human-like. The variational policy keeps many valid options open. RL finetuning tightens the gaps by training on off-script states, adding recovery skills and stability—all while regularization preserves the natural look.

03Methodology

At a high level: Observations + Goals → Stage I (Imitation Expert) → Stage II (Variational Distillation) → Stage III (RL Finetuning) → Actions

Stage I: InterMimic+ (Full-Reference Imitation Expert)

What happens: Train a strong teacher policy that can closely track full human-and-object reference motions while staying physically stable.
Why this step exists: It gives us a natural, wide-ranging skill base. Without it, the student later learns from scratch and may look awkward.
How (with real data):
1. Inputs: full future references (no masking), plus current states.
2. Outputs: joint targets turned into torques via PD control.
3. Rewards: a big tracking term (stay close to reference), an energy/efficiency term, a termination penalty to avoid falls, and a special hand reward that encourages real contact on thin/small objects even if the reference drifts under perturbations.
Example: The reference says “grasp a thin rack.” We add small random pushes to the pelvis or object. The expert nudges fingers and wrist to truly wrap the handle rather than blindly following an off-target path.
Secret sauce: Reference-free hand reward + physical perturbations expand coverage beyond the exact reference, making the expert more precise on tricky contacts.

🍞 Hook: It’s like the best kid in class showing you how to solve a math problem step-by-step. 🥬 The Concept (Distillation): Distillation is teaching a smaller, goal-driven student to behave like a bigger expert, using the expert’s answers as supervision. How it works:

Run the expert to get ‘what to do’ labels.
Train a student to map sparse goals + state to expert-like actions.
Slowly let the student act more on its own (DAgger) to learn from its own states too. Why it matters: Without distillation, the student won’t inherit the expert’s rich, natural skills. 🍞 Anchor: The student learns to place a bottle even when it only sees a future target, not the whole reference movie.

Stage II: InterPrior (Variational Distillation into a Goal-Conditioned Generator)

What happens: Convert the expert into a policy that takes sparse goals (snapshot, trajectory, contact) and samples a latent skill to generate natural actions.
Why this step exists: Real tasks rarely give full references; we need a generator that fills in the blanks plausibly.
How:
1. Prior network (Transformer) reads recent history and masked goals, and outputs a Gaussian (mean and covariance) for the latent.
2. Encoder (training only) also reads the future references to help shape the latent.
3. We sample a latent, project it to a hypersphere (bounded), and decode actions.
4. Loss = Action imitation (match expert), Goal reconstruction (learn to complete hidden parts), KL regularization (align posterior and prior), plus scale and temporal-consistency losses to keep latents stable over time.
Example: The input only specifies ‘right hand should contact the mug handle in 1 second.’ The policy samples a latent that leads to a natural reach, elbow alignment, and wrist rotation.
Secret sauce: Bounded latent on a hypersphere + temporal consistency prevents rare bad samples and keeps motions smooth while still diverse.

🍞 Hook: Practicing from different starting lines makes you better at reaching the finish. 🥬 The Concept (Physical Perturbations & Data Augmentation): Tiny pushes, random starts, and varied object properties widen the situations the policy sees. How it works:

Randomize human/object poses and object dynamics (mass, friction).
Add occasional pushes to induce drift.
Penalize early termination so the policy learns to stay stable. Why it matters: Without this variety, the policy only works in ‘perfect’ conditions. 🍞 Anchor: If the box is heavier than usual, the robot still figures out a stable lift.

Stage III: RL Post-Training (Finetuning Beyond Reference)

What happens: Practice goal-reaching from many unusual or ‘in-between’ states, with sparse success rewards. Keep some environments running the distillation loss to preserve naturalness.
Why this step exists: Distillation alone is brittle when states drift off the data manifold. RL adds robustness and recovery.
How:
1. In-betweening setup: Start from a random state; pick a single snapshot goal; reward success when close enough; also include energy/hand/contact/termination terms.
2. Failure resets: Intentionally start from near-failure (missed grasp, small slip) so the policy learns to recover.
3. Prior preservation: Some environments still optimize the distillation ELBO; others do PPO RL. Shared parameters get both signals.
4. Optional new skill tokens (e.g., “get up”) with simple shaping rewards add missing behaviors.
Example: The robot misses a grasp on a thin rod. Instead of freezing, it re-approaches and re-grasps, because it has practiced exactly this.
Secret sauce: Multi-objective training (RL + distillation) keeps motions human-like while getting much tougher against drift.

🍞 Hook: If you slip on the first step of the stairs, you don’t stop; you steady yourself and keep going. 🥬 The Concept (Failure Recovery): Recovery means noticing when things go wrong and correcting fast. How it works:

Detect near-failure (lost contact, bad alignment).
Adjust the goal prompt (e.g., re-approach) without changing the model.
Use learned skills to get back on track. Why it matters: Without recovery, long tasks crumble after one mistake. 🍞 Anchor: A slipped mug grip becomes a quick re-grasp instead of a dropped cup.

04Experiments & Results

The test: Can a single controller follow different goal types (snapshot, trajectory, contact), chain them over long horizons, and survive tough setups like random initializations or thin-object interactions? We measure:

Success Rate (SR): finished without failure.
Human/Object Errors: how close to the goal state.
Failure Rate: early terminations like falls.

The competition: Strong baselines that already scale across objects and tasks:

InterMimic (full-reference imitator—excellent tracking, weaker robustness).
MaskedMimic (goal-conditioned distillation—versatile but brittle under big drift).

The scoreboard (highlights):

Full-reference imitation under thin objects and noisy starts: • InterMimic: SR 63.9%, very strict tracker but often misses true contact under perturbations. • InterPrior tracking: SR 83.2%. It allows smart micro-deviations to fix contact; human error may be slightly higher (8.9 vs 7.1) because it prioritizes successful interaction over pixel-perfect tracking.
Goal-conditioned tasks (InterPrior full with RL finetuning): • Snapshot goals: 90.0% success; Eh ≈ 13.6; Eo ≈ 9.5; Fail ≈ 3.7%. • Trajectory goals: 94.6% success; Eh ≈ 7.9; Eo ≈ 6.9; Fail ≈ 2.5%. • Contact goals: 90.7% success; Ec ≈ 15.9; Eo ≈ 9.9; Fail ≈ 2.9%. • Multi-goal chaining (long horizon): 68.8% success. This is like keeping a B while others drop to D when the teacher keeps changing the question mid-test. • Random initialization object lifting: 88.6% success; Eo ≈ 11.9.
Compared to MaskedMimic under the same goals, InterPrior’s gains are biggest when tasks are sparse, long, and perturbed—exactly the hard cases we care about.

Surprising findings:

RL finetuning focused only on snapshot goals does not hurt trajectory following; it even helps. Why? Because the training keeps a distillation loss active under trajectory inputs, and when trajectory following starts to drift, the system reframes a near waypoint as a snapshot—so snapshot mastery supports better trajectory tracking.
Bounding the latent on a hypersphere and adding temporal consistency gives a notable stability boost in long, contact-rich rollouts. It’s a light-touch trick with big impact.
InterPrior acts like a reusable prior: even without retraining-from-scratch, it adapts faster and better to new datasets (BEHAVE, HODome), tolerating imperfect labels and misaligned contacts that break pure trackers.

Qualitative takeaways:

Thin-object grasps: InterPrior tweaks hand pose to wrap handles rather than blindly copying the path; grasps ‘stick’ more reliably.
Long-horizon tasks: It transitions smoothly across approach → grasp → lift → reposition, and when drift begins, it self-corrects instead of spiraling into failure.
New objects and sim-to-sim (IsaacGym → MuJoCo): Maintains coherent, human-like motion under sparse object-conditioned goals—encouraging for real-world transfer.

05Discussion & Limitations

Limitations:

Coverage-bound: If the training data never shows certain affordances (e.g., very long, ultra-thin objects), the policy can prefer safe balance over perfect completion.
Hand dexterity: The system models whole-body contacts well but not fine in-hand finger rolling or spinning.
Occasional artifacts: Small penetrations and foot skating can appear over long runs.
Complexity: Three training stages and multiple hyperparameters add engineering overhead.

Required resources:

A GPU physics simulator (e.g., IsaacGym) to run many environments in parallel.
Large-scale HOI demonstrations and storage for rollouts.
Multiple GPUs help when mixing distillation and RL at scale.

When NOT to use:

Tasks dominated by soft-body deformations (e.g., cloth manipulation with straps) not represented in training.
Precision micro-manipulation that needs fingertip-level dexterity and tactile feedback.
Safety-critical deployments without additional sensors, failsafes, and verification.

Open questions:

Perception: How to move from goal keypoints to goals from raw vision in the wild, robustly?
Language: Can natural-language instructions become reliable goal prompts for long chores?
Dexterity: How to integrate richer hand models and tactile cues without losing stability?
Sim-to-real: What minimal extra training or sensing is needed for confident real-world transfer?
Planning + control: How to combine this goal-driven controller with high-level task planning over many objects and rooms?

06Conclusion & Future Work

3-sentence summary: InterPrior is a three-stage recipe—imitation → variational distillation → RL finetuning—that turns natural motion copying into a robust, goal-following, physics-grounded controller. It handles snapshot, trajectory, and contact goals; composes skills over long horizons; and recovers from slips and missed grasps while staying human-like. The same prior adapts to new objects and datasets and even runs on a humanoid embodiment in simulation.

Main achievement: Showing that RL finetuning, anchored to a distilled, goal-conditioned latent prior, is the missing piece that scales from ‘reconstructing demonstrations’ to ‘reliably accomplishing goals under perturbations.’

Future directions: Add perception and language goal-setting, expand hand dexterity and soft-object handling, and push sim-to-real transfer with safety layers. Explore tighter integration with planners and diffusion-based lookahead for even longer, more complex tasks.

Why remember this: It’s a practical blueprint for humanoid loco-manipulation—learn human-like moves, compress them into a flexible generator, then harden them with practice—so robots can not only copy what we do, but also keep doing it when the world misbehaves.

Practical Applications

•Animate long, realistic human-object scenes (approach, grasp, lift, place) with minimal hand-editing.
•Drive VR/AR avatars that respond to sparse user goals while maintaining balance and believable contacts.
•Enable teleoperation where a human sets high-level intents (e.g., ‘pick up box’), and the controller handles details and recovery.
•Prototype household assistive robots that can regrasp after slips and adapt to varied object shapes and masses.
•Benchmark new HOI datasets by plugging them in as additional goals and measuring zero-shot generalization.
•Teach new skills (e.g., get-up) by adding simple tokens and rewards, without retraining everything from scratch.
•Integrate with kinematic planners or diffusion models to turn rough plans into physically consistent motions.
•Stress-test robot designs in simulation with randomized dynamics to evaluate robustness before hardware trials.
•Run interactive demos where a user steers snapshot/trajectory/contact goals via a keyboard in real time.

Version: 1