SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

Kushal Kedia; Tyler Ga Wei Lum; Jeannette Bohg; C. Karen Liu

SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

Intermediate

Kushal Kedia, Tyler Ga Wei Lum, Jeannette Bohg et al.2/18/2026

arXiv

Key Summary

•SimToolReal teaches a robot hand to use many different tools by practicing in simulation and then working in the real world without extra training.
•The robot learns one simple goal: move whatever tool it holds to the next desired pose, again and again, like following dots on a path.
•Instead of hand-making rules for each tool, the team generates lots of simple “tool-like” shapes (handles and heads) and trains one policy on all of them.
•A vision pipeline finds the real tool’s 6D pose (where it is and how it’s turned) and its graspable handle so the policy can work without seeing raw pixels.
•Goal poses come from a single human RGB-D video; the robot tracks that object trajectory step by step (zero-shot).
•Across 120 real tests (24 tasks, 12 tools, 6 categories), SimToolReal generalized well, beating retargeting and fixed-grasp baselines by 37% in task progress.
•It matched task-specific specialist RL policies even though those specialists were trained on the exact object and motion.
•Key training ingredients included SAPG for exploration, domain randomization for robustness, and an asymmetric critic for stable learning.
•The method shines at grasping, in-hand reorientation, and smooth tool motions; hardest cases involve thin, heavy, or visually ambiguous tools.
•This work suggests a general, object-centric way to learn dexterous tool use without per-task engineering.

Why This Research Matters

General dexterous tool use is a gateway to truly helpful robots in homes, workshops, and factories. Instead of retraining or reprogramming for each new object, one object-centric policy can pick up a range of tools and follow a human’s demonstrated motion path. That slashes engineering time and makes deployment faster and cheaper. It also paves the way for safer collaboration, since closed-loop tracking adapts on the fly when small slips or occlusions happen. By framing tool use as goal pose tracking, we can plug in new tasks simply by providing a single human RGB-D video, lowering the barrier for real users to teach robots. Over time, this approach can expand to more objects, more environments, and richer feedback (like touch), making robots far more capable in the messy real world.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how learning to use a new kitchen tool, like a spatula or whisk, gets easier once you understand how to hold it and where to move it? You don’t relearn cooking from scratch for each tool—you reuse the same basic skills.

🥬 The Concept (Why this research exists): Robots have a hard time with tools. They need to pick thin objects off a table, twist them in-hand to a useful angle, and move them while pushing against the world (like brushing or hammering). Before this work, most robot systems learned each tool and task separately. That meant lots of custom simulations, careful reward tuning, and fragile code. The paper’s big idea is to flip the problem: teach one general skill—move the object to the next desired pose—so the same policy works for many tools and tasks.

🍞 Anchor: Imagine a robot asked to sweep crumbs with any brush it finds. Instead of coding special rules for each brush, we train it to grab and rotate objects and then aim them through a series of target poses that trace a sweeping motion.

— New Concept 1 — 🍞 Hook: Think of a video game practice mode where you try moves safely before playing for real. 🥬 Sim-to-Real Transfer: What it is: Teach a robot in simulation, then make it work in real life. How it works: (1) Build a fast simulator; (2) Train the robot with lots of randomized situations; (3) Carefully match delays, noise, and physics quirks; (4) Then run the same policy on the real robot. Why it matters: Without sim-to-real, the robot would freeze or fail when reality doesn’t match the simulator. 🍞 Anchor: A robot learns to pick up and twist a fake “hammer” in sim, then uses those moves on a real hammer without new training.

— New Concept 2 — 🍞 Hook: You know how you learn by trying, getting a thumbs-up for good choices? 🥬 Reinforcement Learning (RL): What it is: A robot learns by trial and error, getting rewards for good outcomes. How it works: (1) Try an action; (2) See what happens; (3) Get a reward based on progress; (4) Adjust the policy to get more reward next time. Why it matters: Without RL, we’d have to hard-code every finger move, which isn’t scalable for dexterous hands. 🍞 Anchor: The robot gets points for moving a tool closer to a target pose and big points when it reaches it.

— New Concept 3 — 🍞 Hook: Picture a treasure map with X marks to hit in order. 🥬 Goal-Conditioned Policy: What it is: A controller that takes the current goal (the next X) as input and chooses actions to reach it. How it works: (1) Read the current tool pose; (2) Read the goal pose; (3) Output joint targets to move toward the goal; (4) When reached, switch to the next goal. Why it matters: Without goals, the robot wouldn’t know which way to move or when to switch steps. 🍞 Anchor: To write a C with a marker, the robot follows a series of marker poses along the letter’s curve.

— New Concept 4 — 🍞 Hook: A chef doesn’t memorize one perfect spoon—they understand “spoon-ness” (a handle plus a bowl) and how to use it. 🥬 Object-Centric Policy: What it is: A policy that focuses on the object’s state (pose, grasp region) rather than pixel images. How it works: (1) Sense the tool’s 6D pose (position + orientation); (2) Provide a coarse bounding box for the graspable handle; (3) Control fingers and arm using this simple object description; (4) Use memory (LSTM) to infer hidden properties like weight. Why it matters: Without object-centric inputs, vision gaps between sim and real can confuse the policy. 🍞 Anchor: For a brush, the policy just needs “where is it?” and “where’s the handle?” to grasp, rotate, and sweep.

The World Before: Parallel-jaw grippers struggled with torques during tool use. Teleoperation demos for dexterous hands were hard due to human-robot mismatch and missing force feedback. Sim-to-real RL existed but usually needed handcrafted object models and custom rewards per task.

The Problem: Create one general dexterous policy that works on many unseen real tools and tasks—without per-tool modeling or per-task rewards.

Failed Attempts: (1) Kinematic motion retargeting from human video often fails to grasp or maintain contact (no force reasoning). (2) Fixed-grasp methods can’t rotate tools in-hand, causing arm collisions or limited reach. (3) Specialist RL policies overfit to one object/trajectory and don’t generalize.

The Gap: A universal objective and training recipe that induces the core skills—grasping, in-hand reorientation, and stable motion—so you can plug in any new tool and a goal pose path at test time.

Real Stakes: Reliable tool use means more helpful robots at home (wiping, writing), in workshops (hammering, brushing), and in factories (screwdriving), all without reprogramming for each new item.

02Core Idea

🍞 Hook: Imagine teaching a friend to use any new gadget by saying, “Just move it through these checkpoints,” instead of explaining every little finger motion.

🥬 The Aha! Moment: If a robot can move any hand-held object to any sequence of poses, it can perform many tool-use tasks without learning each task separately.

Multiple Analogies:

GPS Waypoints: The tool just needs to hit the next waypoint (pose) and then the next—like driving from pin to pin on a map.
Connect-the-Dots: Drawing a picture by moving your pencil through numbered dots mirrors moving a tool through goal poses.
Dance Choreography: Learn basic moves (grasp, rotate, sweep), then follow a new routine (goal sequence) for any song (tool/task).

Before vs After:

Before: Per-task engineering, fragile reward shaping, object-specific modeling, and limited generalization.
After: One policy trained on many simple tool-like shapes and random goal poses, then zero-shot follows human-demonstrated trajectories on real tools.

Why It Works (intuition):

Practicing random goals across many shapes forces the robot to master the fundamentals: lifting from flat surfaces, in-hand reorientation, and maintaining stable contact.
Using only object pose + grasp box avoids the sim-to-real visual gap and keeps the task focused.
Memory (LSTM) helps the policy adapt to unobserved physics like weight distribution.
Training with exploration-friendly RL (SAPG) and an informed critic stabilizes learning of these subtle hand-arm skills.

Building Blocks (Sandwich for each key piece): — New Concept 5 — 🍞 Hook: Think of 3D coordinates like an address: where something lives and how it’s turned. 🥬 6D Pose: What it is: The tool’s position (x, y, z) plus its orientation (roll, pitch, yaw). How it works: Cameras and pose estimators track where the tool is and how it’s rotated in space. Why it matters: Without the exact pose, the robot can’t aim or align the tool correctly. 🍞 Anchor: To hammer a nail, the head must be turned just right; that’s orientation.

— New Concept 6 — 🍞 Hook: When you grab a broom, you hold the handle, not the bristles. 🥬 Grasp Bounding Box: What it is: A simple 3D box around the graspable part (usually the handle). How it works: Segment the tool, find the handle region, define a box in the object’s frame. Why it matters: Without a handle hint, the robot might try to pinch the wrong spot. 🍞 Anchor: The robot sees “this box is safe to grab,” so it closes fingers there.

— New Concept 7 — 🍞 Hook: LEGO kits mix a few bricks to make many toys. 🥬 Procedural Generation: What it is: Automatically creating many tool-like shapes from simple parts (handle + head). How it works: Randomize sizes, shapes (box/cylinder), and densities to create brushes, markers, spatulas, hammers, screwdrivers. Why it matters: Without variety in training, the robot would overfit to one tool and fail on others. 🍞 Anchor: Some heads are heavy like a hammer, some are light like a brush; the policy must handle both.

— New Concept 8 — 🍞 Hook: Practicing in a gym with lights flickering and a noisy crowd makes you ready for game day. 🥬 Domain Randomization: What it is: Add noise, delays, and random forces during training. How it works: Jitter the sensed pose, delay actions, push the object a bit, vary table height. Why it matters: Without it, the robot would be brittle and break down in real-world messiness. 🍞 Anchor: If the camera loses a few frames, a robust policy still tracks the brush path.

— New Concept 9 — 🍞 Hook: A good coach sees more than the player and gives better advice. 🥬 Asymmetric Critic: What it is: The value estimator (critic) gets extra clean info in sim to guide learning, while the actor uses only real-world-available inputs. How it works: Critic sees perfect states; actor sees noisy, delayed observations; both learn together. Why it matters: Without the critic’s extra insight, learning dexterous skills is too unstable. 🍞 Anchor: The coach (critic) speeds up training so the player (actor) performs with noisy stadium conditions.

— New Concept 10 — 🍞 Hook: Split up into teams to explore different trails, then meet to share the map. 🥬 SAPG (Split and Aggregate Policy Gradients): What it is: An RL training method that runs multiple explorer policies in parallel and aggregates their experience. How it works: Split environments among policies, explore more widely, then update a leader using everyone’s data. Why it matters: Without broad exploration, the robot gets stuck and never discovers tricky in-hand rotations. 🍞 Anchor: Different explorers find different grasps; combining them teaches a stronger general policy.

— New Concept 11 — 🍞 Hook: Remembering what just happened helps you decide what to do next. 🥬 LSTM Memory: What it is: A network block that keeps a short-term memory of interactions. How it works: It summarizes recent history to infer hidden properties (like weight or friction). Why it matters: Without memory, the robot can’t adapt its grip when the tool feels heavier than expected. 🍞 Anchor: After lifting a mallet, the policy “remembers” it was heavy and tightens the grasp during swings.

— New Concept 12 — 🍞 Hook: Ask an expert to outline the path you should follow. 🥬 SAM 3D + FoundationPose: What it is: Vision tools that (1) reconstruct a metric 3D mesh and grasp region (SAM 3D) and (2) track 6D object pose over time (FoundationPose). How it works: Segment the tool, build a mesh with true scale, then track its pose through the video to create the goal pose sequence. Why it matters: Without reliable object pose and handle cues, the policy couldn’t run zero-shot in the real world. 🍞 Anchor: From one human video of sweeping, we get a brush mesh, the handle box, and a smooth pose path to follow.

— New Concept 13 — 🍞 Hook: A fair test means the same obstacle course for everyone. 🥬 DexToolBench: What it is: A benchmark of 24 daily tool-use tasks across 12 tools and 6 categories (hammer, marker, eraser, brush, spatula, screwdriver), with matching simulation. How it works: Each task provides a human RGB-D video; the robot must track the object’s pose trajectory. Why it matters: Without a shared testbed, we can’t tell if methods truly generalize. 🍞 Anchor: The robot writes letters, wipes boards, sweeps, flips, hammers, and spins screwdrivers across many instances.

Bottom Line: The core idea is simple but powerful—train one object-centric, goal-conditioned policy on a world of varied, pose-reaching practice, then run it zero-shot on real tool trajectories extracted from a single video.

03Methodology

At a high level: Human RGB-D video → (SAM 3D + FoundationPose) → Sequence of tool goal poses + handle grasp box → Policy input (robot proprioception + 6D pose + grasp box + current goal) → Joint position targets for the arm (7 DoF) and hand (22 DoF).

Step-by-step (with Sandwich explanations for new pieces):

Build a training world with lots of randomized tool primitives.

🍞 Hook: Like a toy factory spitting out endless broomsticks and mallets.
🥬 Procedural Tool Primitives: What: Random handle + head shapes (boxes or cylinders), sizes, and densities. How: Sample lengths, widths, diameters; attach head at the handle end; vary mass so some are head-heavy like hammers. Why: Without this variety, the policy would overfit and fail on new tools.
🍞 Anchor: Some tools end up thin and light (markers), others thick and heavy (mallets).

Define a universal objective: reach random goal poses.

🍞 Hook: Connect-the-dots for any tool.
🥬 Goal-Conditioned RL: What: Train a policy to move the object to the next goal pose; when close enough, switch to the next goal. How: Reward progress toward goals, give a success bonus when reached, then resample a new nearby goal to create smooth trajectories. Why: Without goal-driven rewards, the policy doesn’t learn structured motions.
🍞 Anchor: The policy first learns to lift off the table, then to reorient and follow mini-trajectories.

Observe just what we can reliably get in the real world.

🍞 Hook: Don’t ask for what you can’t measure on game day.
🥬 Object-Centric Inputs: What: Use only current 6D pose and a coarse handle box, not raw images; add robot proprioception. How: The policy reads (pose, handle box, goal) and outputs joint targets; an LSTM adds memory of recent interactions. Why: Without restricting inputs, you inherit sim-to-real vision problems.
🍞 Anchor: The same policy can run on any camera setup that provides tool pose and a handle region.

Stabilize and scale training.

🍞 Hook: Train like a team: explore widely, learn from a smart coach, and practice under noisy lights.
🥬 SAPG: What: Split environments among multiple policies to explore and aggregate their experience to guide a leader policy. Why: Prevents exploration getting stuck.
🥬 Asymmetric Critic: What: Critic sees clean sim states; actor sees noisy, delayed observations. Why: Provides strong learning signals under partial observability.
🥬 Domain Randomization: What: Add delays, pose noise, force bumps, and table-height changes. Why: Builds real-world robustness.
🍞 Anchor: With these three together, the robot learns reliable in-hand rotations and strong grasps.

Convert one human RGB-D video into a goal path and handle box.

🍞 Hook: Ask a camera-savvy friend to mark the path for you.
🥬 SAM 3D + FoundationPose: What: (1) Reconstruct a metric 3D mesh from the video’s depth; (2) Segment the handle to define a grasp box; (3) Track 6D pose across the video to get the goal sequence; (4) Downsample to smooth it and cut the pre-grasp still phase. Why: Without accurate mesh/pose, the path is shaky and hard to follow.
🍞 Anchor: A wiping video turns into clean eraser poses spaced along the wipe path.

Run the policy closed-loop on the real robot.

🍞 Hook: Keep checking the map and correct course as you go.
🥬 Closed-Loop Tracking: What: At 30 Hz, update the current object pose; feed the next goal; move until within tolerance; then advance. How: The LSTM policy outputs joint position targets for both arm and hand; actions are smoothed for safety. Why: Without closed-loop, small slips or noise would accumulate and cause failure.
🍞 Anchor: If the screwdriver drifts, the policy recenters it before spinning again.

What breaks without each step:

No procedural variety → overfit to one tool.
No goal-conditioned reward → aimless motions.
No object-centric inputs → vision gap hurts transfer.
No SAPG/critic/randomization → unstable learning and poor robustness.
No perception pipeline → no real-world goals or grasp hints.
No closed-loop → drift and drops.

Concrete mini-examples:

Data: Four keypoints on the tool define pose error; getting all four close means both position and orientation are right.
Reward: The agent only gets credit when it makes new progress toward the goal (not for hovering nearby).
Action: The policy sets desired joint positions; fingers typically move absolutely (to a shape), while the arm moves in smooth deltas.

Secret Sauce:

The universal object-centric goal-reaching task acts like a “core workout” for dexterity, naturally inducing grasping, reorientation, and steady motion. Paired with robust training (SAPG, asymmetric critic, randomization) and simple real-world inputs (pose + handle box), it unlocks zero-shot transfer.

04Experiments & Results

The Test: Use DexToolBench—24 real-world tool-use trajectories across 12 tools and 6 categories (hammer, marker, eraser, brush, spatula, screwdriver). Measure Task Progress: the percent of goal poses matched within about 2 cm; like counting how many steps of a dance you hit exactly.

The Competition:

Kinematic Retargeting: Copy human hand motion kinematically from the video—no force reasoning.
Fixed Grasp: Hold the grasp stiff and try to track the object path only by moving the arm.
Specialist RL: Train a separate RL policy on a single object + single trajectory per category.

Scoreboard (with context):

Overall Zero-Shot: Across 120 real trials, SimToolReal generalized strongly to unseen tools and motions. It outperformed retargeting and fixed-grasp methods by 37% Task Progress, which is like getting a solid A when others get a C+. It also matched the performance of specialist RL policies on their home turf—even though those specialists trained on the exact object and motion.
Category Patterns: Highest scores on eraser tasks (mostly translations). Marker tasks were okay but thin shapes are trickier to grasp and easier to occlude. Brush, spatula, hammer, and screwdriver need in-hand rotations: performance stayed strong but dipped for thinner or heavier tools (e.g., the thin flat spatula and the heavy mallet).
Baseline Case Study (Brush Sweep): On an easy start (no rotation needed), fixed-grasp can follow somewhat but worse than SimToolReal. On a hard start (needs 90° rotation), fixed-grasp causes arm-table collisions; kinematic retargeting fails to grasp; SimToolReal succeeds via in-hand rotation.
Specialists vs Generalist (in sim): Specialists ace their specific training setup but drop sharply when the object or trajectory changes. SimToolReal stays consistently strong across all variants.

Surprising Findings:

Training Progress Predicts Generalization: As the policy gets better at random goal-reaching on procedurally generated tools, zero-shot performance on DexToolBench rises in lockstep—strong evidence that the universal training objective truly builds reusable dexterity.
Robust Recovery: When drops happen, the policy often re-grasps and tries again, provided tracking isn’t lost and the object is reachable.

Failure Insights (why things go wrong):

Pose Tracking Loss (~44% of failures): Small, dark, or symmetric tools (markers, small screwdrivers) plus hand occlusions can confuse the estimator.
Object Drops (~35%): Heavier tools or contact with the environment can shake loose the grip.
Incomplete In-Hand Rotation (~18%): Very thin tools resist stable reorientation.
Initial Grasp Failures (~4%): Mostly thin objects rolling off the table.

Ablations (what mattered most):

SAPG over PPO: SAPG’s diverse exploration unlocked much higher rewards and better skills.
Asymmetric Critic: Removing privileged info for the critic severely hurt learning under partial observability.

Big Picture: The method isn’t just competitive; it’s robust and flexible, succeeding where motion retargeting and fixed-grasp strategies often fail.

05Discussion & Limitations

Limitations (be specific):

Function vs Motion: Tracking tool poses doesn’t guarantee task completion when high forces are needed (e.g., actually driving a nail fully).
Environment-Blind Goals: Only conditioning on object pose can lead to collisions in clutter or with the table during aggressive moves.
Rigid-Tool Assumption: Flexible or articulated tools (e.g., scissors) aren’t modeled; pose alone may not capture their state.
Fixed High-Level Plan: The goal sequence is not replanned online when conditions change (e.g., tracking jitters or object shifts).

Required Resources:

A dexterous hand + arm (29 DoF in this work), GPU-based simulator (e.g., Isaac Gym), and lots of parallel environments.
Vision stack: RGB-D camera and foundation models (SAM 3D, FoundationPose) for mesh, handle box, and pose tracking.
Compute for RL (SAPG), plus calibration to match sim and real dynamics.

When NOT to Use:

Tasks where force outcomes matter more than pose (e.g., torquing screws to a target torque, heavy-duty hammering).
Highly cluttered scenes needing environment-aware planning.
Non-rigid or articulated tools where pose isn’t enough.
Settings with poor RGB-D sensing or severe occlusions beyond what tracking can handle.

Open Questions:

Can we fuse environment awareness (e.g., obstacles, target surfaces) into the goal conditioning to avoid collisions automatically?
How can we incorporate force/torque targets or tactile feedback to guarantee functional task success?
Can we extend beyond rigid tools to deformable or articulated ones with richer state representations?
Could multi-view or wrist-mounted cameras stabilize pose tracking for thin or symmetric tools?
How far can large-scale procedural training push zero-shot dexterity—what are the scaling laws?

06Conclusion & Future Work

3-Sentence Summary: SimToolReal reframes dexterous tool use as moving a handheld object through a sequence of desired poses, trained once in simulation across many procedurally generated tools. By using simple, robust inputs (6D pose + handle box), a strong RL recipe (SAPG, asymmetric critic, randomization), and trajectory goals from a single human RGB-D video, the same policy runs zero-shot on real tools. It beats retargeting and fixed-grasp baselines and matches specialists while generalizing widely.

Main Achievement: A single, object-centric, goal-conditioned policy that zero-shot manipulates many unseen real tools and tasks without per-task engineering or additional training.

Future Directions: Add environment-aware conditioning to avoid collisions; integrate force or tactile objectives for guaranteed functional success; support deformable/articulated tools; improve pose tracking with multi-view sensing; scale up procedural diversity and analyze generalization laws.

Why Remember This: It shows a simple, universal objective—reach the next object pose—can unlock broad dexterous skills, transforming tool use from a collection of one-off tricks into a general, reusable capability.

Practical Applications

•Teach a household robot to wipe a whiteboard or tabletop by recording one short human demo video.
•Have a workshop assistant grasp any available brush or scraper and follow a cleaning trajectory without reprogramming.
•Enable a kitchen robot to flip or serve with new spatulas it has never seen before using pose trajectories.
•Assistive robots can write simple notes with any pen or marker by tracking letter-shaped pose sequences.
•Rapidly set up light hammering or tapping tasks (e.g., alignment knocks) by extracting swing paths from a demo.
•Prototype screwdriver maneuvers like positioning and initial spinning using a single RGB-D demo, zero-shot.
•Scale testing of dexterous hands across many tool shapes in simulation before buying real tools.
•Use the same general policy across multiple sites with different camera setups by relying on pose and handle boxes.
•Recover from small mistakes (drops or drift) during tasks thanks to closed-loop, goal-by-goal tracking.
•Extend lab demos to real deployments quickly by skipping per-tool reward tuning and object modeling.

Version: 1