Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

Runpei Dong; Ziyan Li; Xialin He; Saurabh Gupta

Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

Intermediate

Runpei Dong, Ziyan Li, Xialin He et al.2/18/2026

arXiv

Key Summary

•This paper teaches a humanoid robot to find and pick up many different objects in new places using plain-language requests like 'grab the orange mug.'
•The key is a modular system: big vision models choose what and where to grasp, and a super-accurate hand (end-effector) tracker moves the robot’s body to the right spot.
•They fix a big hidden problem: on affordable humanoids, textbook forward kinematics and base odometry are inaccurate, so they learn small corrections (residual neural models) to make them precise.
•The tracker also plans motions with inverse kinematics, re-plans during execution, and nudges the goal (goal adjustment) so the hand ends up exactly where it should.
•In real tests, the robot averaged 2.44 cm hand-position error—about 3.2× better than strong baselines—and could grasp objects across tables from 43 cm to 92 cm tall.
•The system handled open-vocabulary commands, succeeding 90% on 10 everyday objects, 73.3% across 10 new scenes, and 80% in clutter.
•This shows a practical path beyond huge real-world imitation datasets: combine general vision with accurate, sim-trained control.
•The approach uses onboard cameras only, works in cafés, offices, and labs, and coordinates full-body bending, squatting, and twisting while staying balanced.
•Failures mostly came from the simple Dex-3 hand (objects slipping or tipping), not from the controller or perception core.
•The work opens a bridge to reuse many manipulation skills on versatile humanoids by nailing precise end-effector control.

Why This Research Matters

With precise open-vocabulary grasping, a humanoid can help in everyday places without needing special retraining for each object. It can fetch items at home, restock shelves in a café, or assist in offices and classrooms, acting on natural-language requests. Hospitals and eldercare settings could benefit from reliable, careful object handling across different rooms and layouts. Because the system is modular, upgrades to vision or control instantly improve the whole robot without re-collecting massive datasets. The learned residuals mean affordable hardware can still achieve high precision, making practical robots more accessible. This work points to a future where robots quickly adapt to our words, our spaces, and our stuff.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how you can close your eyes and still reach the spoon on the table because your body has a great sense of where your hand is? Robots want to do that too—reach, balance, and grab things they’ve never seen before—even in rooms they’ve never visited.

🥬 Filling (The Actual Concept - The World Before): Before this research, humanoid robots were good at cool stunts (like balancing or even doing flips) and okay at copying motions humans demonstrated. But picking up new objects, in new places, based on camera views and a natural-language request (like “grab the green apple”) was still very hard. Why? Two big obstacles stood in the way:

Seeing and deciding: The robot needs to understand what the word means (like 'orange mug') and find it in a messy picture from its head camera.
Precise hand control: The robot’s hand (the end-effector) must move to the exact 3D grasp spot while the whole body bends, twists, and stays stable.

🥬 Filling (Failed Attempts): A popular approach tried to learn everything end-to-end from real-world human demonstrations. But collecting enough diverse, high-quality demos is extremely hard and time-consuming. These all-in-one models often struggle when the robot walks into a new room or sees a new object. Another problem: even if you give a robot the right 3D target, many humanoid controllers miss by 8–13 cm—far too sloppy to grab a soda can.

🥬 Filling (The Gap): The field needed a way to mix the best of both worlds: use powerful, pre-trained vision models that already recognize lots of objects and words (so you don’t need to collect new data), and pair them with a control system that can place the hand exactly where the grasp should be, even while the robot’s whole body moves.

🥬 Filling (Why It Matters): If a robot can be told, “pick up the red book near the keyboard” and do it safely and precisely, it becomes useful in offices, cafés, classrooms, and homes—without re-training for each new item or place. It could help clean up, bring tools, fetch groceries, or assist someone with mobility challenges. This is not just about being clever; it’s about being truly helpful.

🍞 Bottom Bread (Anchor): Imagine a robot in a coffee shop. You say, “Grab the orange mug.” It looks around with its camera, finds the mug even if it’s never seen that brand or color pattern before, plans a safe path, bends at the waist, reaches forward, and grasps it gently—without knocking it over. That’s the world this paper pushes us toward.

Now, as we walk through the paper’s story, we’ll introduce each new idea using the Sandwich pattern so the whole recipe makes sense end-to-end.

— New Concept Sandwich — 🍞 Hook: You know how your fingertips have to be exactly on the pencil to pick it up? 🥬 End-Effector Control: This is how a robot moves its hand (or tool) to a precise spot and orientation.

How it works: (1) Set a 3D goal pose for the hand, (2) compute body and arm motions to reach it, (3) keep balancing while moving, (4) adjust if drifting.
Why it matters: Without precise end-effector control, the robot can’t grasp reliably; it will miss, bump, or slip. 🍞 Anchor: To pick up a USB stick, the robot’s hand must be within a couple of centimeters and properly oriented—end-effector control makes that possible.

— New Concept Sandwich — 🍞 Hook: Imagine asking a friend, “Pass me the piranha plant toy,” and they find it even if it’s a toy they’ve never seen. 🥬 Open-Vocabulary Understanding: The robot can interpret many everyday words (like 'olive oil bottle' or 'helicopter') it was not specially trained on.

How it works: (1) Convert text to a concept, (2) detect/segment that object in the camera image, (3) pass its 3D location to the controller.
Why it matters: Without it, the robot only handles a small, pre-labeled list of objects and gets confused by new ones. 🍞 Anchor: When told 'grab the green apple', the robot picks the green one, not the red one, even if both are present.

— New Concept Sandwich — 🍞 Hook: Like LEGO bricks, you can swap pieces or fix one without rebuilding the whole castle. 🥬 Modular System Design: The robot’s brain is split into parts: a vision/planning part and a movement/control part.

How it works: (1) Vision module finds the object and a good grasp, (2) control module moves the body and hand precisely to that grasp, (3) they communicate but stay independent.
Why it matters: If vision gets better, you swap it in; if control gets better, you swap that in—no need to relearn everything. 🍞 Anchor: The system uses large vision models for recognition and a separate, super-precise hand tracker for motion—plug-and-play upgrades.

02Core Idea

🍞 Hook: Imagine building a search-and-grab superpower for a robot: your words tell it what to fetch; its eyes find the thing; and its body places the hand exactly where it needs to go—even while squatting or twisting.

🥬 The Aha! Moment (One sentence): Combine open-vocabulary vision that can find almost any object with a residual-aware, motion-planned, re-planning end-effector tracker that lands the robot’s hand within a few centimeters—reliably.

— Multiple Analogies —

Orchestra: Vision is the sheet music (what to play), the end-effector tracker is the conductor (how to play precisely), and the body joints are the instruments (moving in harmony). Before, the instruments played roughly; now the conductor gets them perfectly in sync.
GPS + Parking Assist: Vision is the GPS that picks the destination (the object and grasp), while the tracker is the parking assist that precisely lines up into the tight spot without scraping anything.
Sports Drill: Vision says “catch that ball over there,” while the tracker teaches the body the exact footwork and hand placement so the catch sticks—not just close, but precise.

— Before vs After —

Before: End-to-end methods needed massive demos and still struggled with new objects or rooms. Controllers that looked smooth in joint space still missed the grasp by 8–13 cm.
After: With large vision models and a residual-aware tracker, the robot lands its hand within ~2.5 cm, even while bending, squatting, or twisting—enough to grasp reliably in many real scenes.

— Why It Works (Intuition, not equations) —

If your measuring tape (forward kinematics) is off by 1–2 cm all the time, you can learn a small correction (residual) that fixes it consistently.
If your body position drifts as you move, learning base odometry corrections lets you keep your map updated.
If your plan gets stale as you move, re-planning refreshes the path.
If your hand keeps stopping a bit short, nudge the goal a tiny bit farther (goal adjustment). Together, these tiny, smart fixes turn a decent motion into a precise grasp.

— Building Blocks (Each as Sandwiches) — 🍞 Hook: You know how you figure out how to bend your arm to reach a shelf? 🥬 Inverse Kinematics (IK): IK computes the joint angles that place the hand at a desired 3D pose.

How it works: (1) Start from the goal hand pose, (2) solve for shoulder/arm/waist angles that reach it, (3) respect joint limits and collisions.
Why it matters: Without IK, you can’t turn a hand goal into a feasible body posture. 🍞 Anchor: To reach a low coffee table, IK adds some knee bend and waist tilt to let the hand get down safely.

🍞 Hook: Planning a route before you start walking avoids bumping into furniture. 🥬 Motion Planning: It generates a smooth, collision-free joint trajectory from where you are to the IK goal.

How it works: (1) Takes depth from the camera and the robot’s geometry, (2) finds a safe path in joint space, (3) produces a reference trajectory to track.
Why it matters: Without it, the arm may sweep into objects or twist into awkward, unstable shapes. 🍞 Anchor: When the mug sits behind a box, the planner arcs the wrist around the box instead of smashing into it.

🍞 Hook: If a ruler is bent, your measurements are always a bit off. 🥬 Forward Kinematics (Analytical FK): A mathematical method to compute where the hand is from joint angles and robot geometry.

How it works: (1) Uses link lengths and angles to sum up transformations, (2) outputs the hand pose in the base frame.
Why it matters: If FK is inaccurate (due to elasticity or sensors), the tracker thinks the hand is somewhere it isn’t and misses the grasp. 🍞 Anchor: On a low-cost humanoid, FK was off by ~1.76 cm—enough to miss small objects.

🍞 Hook: When your watch runs 2 minutes slow every day, you add 2 minutes when you check time. 🥬 Residual Neural Forward Models: Small learned correction models that fix systematic FK and odometry errors.

How it works: (1) Run analytical FK/odometry, (2) add a learned residual from a neural net trained on motion-capture truth, (3) get a much more accurate hand/base pose.
Why it matters: Without residuals, tiny constant errors stack up and ruin precise grasps. 🍞 Anchor: Correcting FK with residuals brought hand pose error down to ~0.27 cm in tests.

🍞 Hook: If you turn your torso, your reference changes—even if your feet didn’t move. 🥬 Neural Odometry (Residual): Learns how the robot base moves (relative to the feet) during whole-body motions.

How it works: (1) Use leg joint states to compute base motion, (2) add a learned residual to remove drift, (3) keep the map aligned.
Why it matters: Without it, grasp targets set earlier won’t match where the robot really is now. 🍞 Anchor: As the robot squats and reaches, the learned odometry keeps the hand target lined up with the real table spot.

🍞 Hook: If traffic changes, you don’t stick to last hour’s directions—you re-check the route. 🥬 Closed-Loop Replanning: Periodically re-compute the reference trajectory while moving.

How it works: (1) Every few seconds, re-run the planner with current state, (2) shorten errors, (3) refresh to avoid stale plans.
Why it matters: Without it, small drifts grow and the final grasp can miss. 🍞 Anchor: Replanning every ~6 seconds reduced end-effector error significantly in real tests.

🍞 Hook: If your throws always land short by a tiny bit, aim a little farther. 🥬 Goal Adjustment: Slightly scale the current hand error to push the target so the final hand ends up right on spot.

How it works: (1) When close, multiply the position error by a small factor (~1.6), (2) don’t overdo rotation, (3) stop when within ~2 cm.
Why it matters: Without it, the hand may consistently stop a little short due to sim-to-real mismatch. 🍞 Anchor: This nudge helped close the last centimeters needed for a solid grasp.

🍞 Hook: Practice in a gym makes you better for the real game. 🥬 Sim2Real Training: Train the controller in fast simulation, then transfer to the real robot, using randomization to handle differences.

How it works: (1) Simulate physics with many varied settings, (2) learn a robust policy with PPO, (3) deploy on real hardware.
Why it matters: Without it, you’d need massive real demos; with it, you learn skills safely and faster. 🍞 Anchor: The tracker was trained in Isaac Gym then used on a Unitree G1 with close performance to motion-capture truth.

🍞 Hook: Learning to play a game by trying and improving your moves each round. 🥬 PPO (Reinforcement Learning): A popular method to train policies by trial, reward, and improvement steps without exploding updates.

How it works: (1) Roll out many episodes, (2) compute advantages, (3) update policy with a clipped objective.
Why it matters: Without stable RL, learning consistent whole-body precision is difficult. 🍞 Anchor: PPO trained the whole-body tracker on thousands of reach goals plus human motion priors from AMASS.

03Methodology

At a high level: Language query + RGB-D → Open-vocabulary perception (detect + segment) → 6-DoF grasp generation → Retarget to Dex-3 hand → IK + motion planning → Residual-aware end-effector tracking (with neural FK + neural odometry + goal adjustment + periodic replanning) → Grasp and lift.

Step-by-step, like a recipe:

Input: Free-form language + head camera images (RGB-D). Example: “Pick up the orange mug.”

What happens: The system uses large vision models to find that specific object in the camera view and segment it out from the background.
Why this step exists: Without knowing exactly which pixels are the object, the robot can’t pick a safe, stable grasp.
Example: It places a tight mask around the orange mug even if a red mug is nearby.

— Sandwich: Large Vision Models (Grounding DINO + SAM) — 🍞 Hook: Like a friend who can quickly spot your backpack in a crowded room. 🥬 Large Vision Models: Pre-trained detectors/segmenters that can identify many objects from text prompts and delineate them in images.

How it works: (1) Grounding DINO finds the target box from text, (2) SAM refines it into a detailed mask.
Why it matters: Without strong, general vision, you’d need per-object training; with LVMs, you handle a wide range of objects. 🍞 Anchor: The robot hears “toy dog” and segments the plush dog even if it’s a brand-new toy.

Grasp Pose Proposals: Generate 6-DoF grasps for the segmented object using AnyGrasp; filter to pick stable options (e.g., jaws parallel to the table).

What happens: From the object’s shape in 3D, the system proposes many possible gripper poses and ranks them by confidence and feasibility.
Why this step exists: A grasp point must be both reachable and stable; bad choices lead to slips or knocks.
Example: The system filters out grasps coming from behind the object (on the far side) if the robot’s hand can’t safely get there.

— Sandwich: Grasp Pose Generation (AnyGrasp) — 🍞 Hook: When you pick up a box, you choose a side and angle that feels steady. 🥬 AnyGrasp: A model that proposes many possible grasp poses around an object and scores how good they are.

How it works: (1) Analyze object geometry, (2) predict candidate grasps, (3) rank and filter with task constraints.
Why it matters: Without reliable grasp proposals, the robot might grab thin air or unstable edges. 🍞 Anchor: For the olive oil bottle, AnyGrasp suggests side grasps with good contact, skipping the slippery cap.

Retarget to Dex-3 Hand: Convert a parallel-jaw grasp into the Dex-3 hand configuration by rotating ~45° so thumb becomes one jaw and the other fingers the opposite jaw; clip extreme yaw to avoid awkward full-body postures.

What happens: The selected grasp pose is mapped into a 6-DoF end-effector target that matches the real hand’s finger layout and limits.
Why this step exists: The Dex-3 hand isn’t a simple straight gripper; retargeting boosts contact area and tolerates small errors.
Example: That rotation lets the hand squeeze the mug like two jaws rather than poking with fingertips.

— Sandwich: Dex-3 Hand Retargeting — 🍞 Hook: A pianist switches fingerings when moving from piano to keyboard. 🥬 Retargeting: Translating an abstract gripper pose into the exact wrist pose and finger arrangement for the Dex-3.

How it works: (1) Rotate the grasp by 45° around z, (2) arrange thumb vs two fingers as jaws, (3) clip angles to keep motion natural.
Why it matters: Without retargeting, the hand won’t make stable contact and grasps will fail. 🍞 Anchor: The hand retargeting helps hold a game cartridge without it slipping out.

IK + Motion Planning: Use inverse kinematics to find a feasible upper-body goal and cuRobo to generate a collision-free trajectory from the current pose to that goal.

What happens: The planner uses the depth map to avoid obstacles and produce a smooth path.
Why this step exists: Getting there safely and stably matters as much as the final grasp.
Example: If the table is high, IK bends at the waist while keeping balance; the planner avoids brushing the mug before the final approach.

— Sandwich: Motion Planning (cuRobo) — 🍞 Hook: Like planning a path through a crowded hallway without bumping people. 🥬 Motion Planning: Compute a safe, smooth sequence of joint positions that lead to the target.

How it works: (1) Consider joint limits and collisions, (2) optimize a path, (3) output a time-parameterized trajectory.
Why it matters: Without it, arms collide with objects or twist into unstable shapes. 🍞 Anchor: For a cluttered desk, the plan arcs the elbow around a book stack.

Residual-Aware EE Tracking: A learned policy follows the reference trajectory but also uses accurate end-effector and base estimates from residual neural models—plus goal adjustment and replanning—to stay precise.

What happens: The policy outputs 29-DoF joint targets; PD controllers turn those into torques; periodic replanning refreshes the path.
Why this step exists: Real robots have elastic joints and small sensor errors; the residuals, adjustments, and replans keep the hand on target.
Example: If the hand keeps being 1 cm short, goal adjustment nudges it; if the torso moved more than expected, odometry residual fixes the base estimate.

— Sandwich: Residual Neural FK — 🍞 Hook: If your map is always off by one block east, you correct for it. 🥬 Residual FK: A neural net learns a small transform to add to analytical FK to get an accurate hand pose.

How it works: (1) Compute FK from joint angles, (2) predict a residual translation + rotation, (3) compose them to get a precise pose.
Why it matters: Without this, the controller thinks the hand is somewhere else and misses the grasp. 🍞 Anchor: In tests, residual FK reduced hand pose error from ~1.76 cm to ~0.27 cm.

— Sandwich: Neural Odometry (Residual) — 🍞 Hook: When you lean forward, your center shifts; your sense of where you stand must update. 🥬 Residual Odometry: Learns base motion from leg joints assuming feet are planted, then corrects FK odometry with a residual.

How it works: (1) Compute base pose change from legs, (2) add neural residual, (3) keep the world-aligned correctly.
Why it matters: Without it, the target drifts as you move and the final reach won’t match the real world. 🍞 Anchor: While squatting to a short table, residual odometry keeps the hand target aligned with the intended point.

— Sandwich: Closed-Loop Replanning — 🍞 Hook: You re-check directions mid-trip if roads change. 🥬 Replanning: Every few seconds, update the reference trajectory given the current state.

How it works: (1) Re-run planner quickly (~20 ms), (2) shorten error, (3) keep the path feasible.
Why it matters: Without replanning, you can get stuck following an outdated plan. 🍞 Anchor: Replanning every ~6 seconds lowered tracking error notably.

— Sandwich: Goal Adjustment — 🍞 Hook: If you always stop one step shy of the door, aim one step past it. 🥬 Goal Adjustment: Slightly scale the positional error (not rotation) when close, to overcome persistent shortfall.

How it works: (1) Start when error < 15 cm, (2) scale by ~1.6, (3) stop once within ~2 cm.
Why it matters: Without it, small biases remain and grasps fail by a hair. 🍞 Anchor: This final nudge helped the hand close exactly on a small object like a USB stick.

— Sandwich: Training via PPO — 🍞 Hook: Practicing a drill again and again until you can do it with your eyes closed. 🥬 PPO Training: The policy learns in simulation using many reaching targets and human-motion priors, with domain randomization.

How it works: (1) Sample reach goals and trajectories, (2) reward accurate end-effector + stable motion, (3) randomize dynamics to prepare for reality.
Why it matters: Without robust training, tiny real-world differences would derail precision. 🍞 Anchor: After sim training, the policy was deployed on the Unitree G1 and hit real targets within ~2.44 cm.

Secret Sauce (What’s clever):

Residual corrections for both hand pose (FK) and base odometry fix systematic hardware errors.
Periodic replanning prevents drift from accumulating.
Goal adjustment cleanly removes the last centimeter-scale bias.
All of this is wrapped around a planner-guided, sim-trained policy—modular, accurate, and generalizable.

04Experiments & Results

The Tests (What they measured and why):

End-to-end open-vocabulary grasping: Can the robot, using only onboard sensors, find the named object and grasp it across different table heights, rooms, and clutter levels?
Tracking accuracy: How close does the hand get to the intended 3D target? Small numbers here mean reliable grasps.
Component ablations: How much do residual FK/odometry, replanning, and goal adjustment help? If you remove each, does performance drop?

The Competition (Baselines):

AMO and FALCON—two strong recent whole-body trackers known for impressive tracking in joint space but not necessarily for precise hand placement.
Analytical FK / odometry without learned residuals.

Scoreboard with Context:

End-to-End Grasping on Everyday Objects (Real World)

10 daily objects on standard (0.74 m) and short (0.56 m) tables: 90% success.
Meaning: This is like scoring 9 out of 10 on a tough test—strong proof the system really works outside the lab’s comfort zone.

Scene Generalization (10 Novel Scenes, 10 New Objects)

Success: 22/30 (73.3%) across offices, cafés, lounges, labs, and classrooms.
Meaning: In new rooms with different lighting and layouts, the system still picks correct items most of the time—vision generalization is doing its job.

Cluttered Scenes (5 Layouts)

Success: 12/15 (80%).
Meaning: Even when objects are near distractors, language + segmentation keeps the robot on the right target.

Tracking Accuracy vs State of the Art (Simulation)

Average translation error: HERO ~2.48 cm; AMO ~8.29 cm; FALCON ~13.57 cm.
Meaning: HERO’s hand lands about three times closer than the best baseline—like parking cleanly between lines vs. parking across two spots.
Note: HERO’s joint tracking error was larger, but end-effector error was much smaller. This shows that optimizing joints alone doesn’t guarantee good hand placement—the task-space view wins.

Real-World Tracking with MOCAP Truth

Mean hand error ~2.44 cm (with motion capture measurements).
Using learned residual FK/odometry brought performance close to the oracle (motion capture) case.
Meaning: The learned residuals effectively replace expensive external tracking—your onboard estimate gets nearly as good.

Ablations (What mattered):

Residual FK and odometry: Replacing either with analytical versions increased errors; using both learned residuals lowered error to near-oracle.
Replanning: Removing it significantly increased error over time—periodic updates are crucial.
Goal adjustment: Smaller but consistent gains; it smooths out the last centimeters.

Surprising Findings:

Analytical FK on a low-cost humanoid is off by ~1.76 cm in translation—not huge but huge enough to break grasps. Because the error is systematic, a small neural residual fixes it dramatically.
Joint-space-perfect trackers can still miss grasps because end-effector placement is what manipulation truly needs. HERO explicitly optimizes for that.
With learned residuals and replanning, onboard estimates approach MOCAP accuracy—making real-world deployments practical without external motion-capture systems.

Concrete Examples:

Orange mug on a high table: The system squats slightly and bends the waist to align the hand. Learned odometry keeps the target steady even as the torso shifts.
Cluttered desk with books: Planning arcs the wrist around books; residual FK ensures the final pinch aligns cleanly on the selected jaw grasp.
Toy dog in a lounge: Open-vocabulary detection locks onto the plush dog; the hand target is slightly nudged by goal adjustment to ensure a firm grip.

05Discussion & Limitations

Limitations (Specific):

Narrow camera field of view: When the torso twists or the target is far (>~0.6–1 m), the object may leave view. This hinders purely visual closed-loop correction during big whole-body motions.
Planner-dependence: Using a classical planner can yield awkward or energy-inefficient poses; integrating learned priors for more natural, efficient motions is a next step.
Dexterity constraints: The Dex-3 hand is simple; large or thin, unstable objects can slip or tip. Failures often reflect the gripper, not the controller.
Modular brittleness: In very complex scenes, vision modules (detection/segmentation) can make mistakes, which cascade to grasp failure.

Required Resources:

A humanoid with head RGB-D camera, IMU, proprioception; compute capable of running detection/segmentation and planning (e.g., laptop with a mid-range GPU).
Motion-capture only for offline calibration/validation if desired; learned residuals reduce dependence on external systems.
Simulation infrastructure (Isaac Gym or similar) for training and domain randomization.

When NOT to Use:

Tasks requiring fine finger-level dexterity (e.g., threading a needle) with a simple hand.
Very long-range searching or high-speed moving targets, given the limited FOV and camera placement.
Highly deformable, transparent, or reflective items where current grasp generators struggle.

Open Questions:

Active vision: How much would a moving head/eyes (neck DoFs, gaze control) improve closed-loop performance when the torso twists?
Energy-efficient planning: Can learned trajectory priors make whole-body reaches look more human-like and waste less energy?
Better hands: How does performance scale with more dexterous grippers? Does the same control stack unlock complex in-hand manipulation?
Multi-step tasks: How to chain grasps (open a door, pick a bottle inside, pour) using the same modularity and residual-precision ideas?

06Conclusion & Future Work

Three-Sentence Summary: This paper presents HERO, a modular humanoid system that combines open-vocabulary vision and a residual-aware end-effector tracker to precisely grasp novel objects in novel scenes. By learning small corrections to forward kinematics and base odometry, and by using motion planning, periodic replanning, and goal adjustment, the robot’s hand reliably lands within a few centimeters of target. The result is robust, general, and practical open-world grasping—90% success on everyday objects and strong generalization to new rooms and clutter.

Main Achievement: Turning precise end-effector control into a dependable building block for open-vocabulary humanoid manipulation by fixing the hidden, systematic geometry errors that previously sabotaged accuracy.

Future Directions:

Add active vision (neck/eye control) for continuous visual feedback during big whole-body reaches.
Blend learned trajectory priors with planning for more natural, energy-efficient motion.
Upgrade grippers to unlock dexterous in-hand actions and more challenging tasks (like door unlatching without help).
Chain tasks to complete multi-step household skills.

Why Remember This: It shows a scalable recipe: let big vision models decide what and where, and let a precise, residual-aware tracker handle how to get there—proving that small, smart corrections can deliver big, reliable wins in real-world humanoid manipulation.

Practical Applications

•Voice-driven fetching: “Bring me the red notebook from the desk.”
•Office tidying: Collect cups and bottles from varied tables in lounges and meeting rooms.
•Retail assistance: Pick and place products on shelves with different shapes and labels.
•Hospital logistics: Retrieve labeled items (e.g., 'blue sanitizer bottle') in changing environments.
•Café service: Find and carry mugs, bottles, or snack items from counters of different heights.
•Education/labs: Assist by gathering tools or components named by students or researchers.
•Light home assistance: Pick up toys, books, or groceries named by the user.
•Maintenance tasks: Grasp and hand over specific tools from cluttered benches.
•Demo and training platform: Upgrade vision or grippers without reworking the whole pipeline.
•Door interaction starter: Identify and grasp handles to begin opening tasks (with stronger hands next).

Version: 1