GR-Dexter Technical Report

Ruoshi Wen; Guangzeng Chen; Zhongren Cui; Min Du; Yang Gou; Zhigang Han; Liqun Huang; Mingyu Lei; Yunfei Li; Zhuohang Li; Wenlei Liu; Yuxiao Liu; Xiao Ma; Hao Niu; Yutao Ouyang; Zeyu Ren; Haixin Shi; Wei Xu; Haoxiang Zhang; Jiajun Zhang; Xiao Zhang; Liwei Zheng; Weiheng Zhong; Yifei Zhou; Zhengming Zhu; Hang Li

GR-Dexter Technical Report

Intermediate

Ruoshi Wen, Guangzeng Chen, Zhongren Cui et al.12/30/2025

arXiv PDF

Key Summary

•GR-Dexter is a full package—new robot hands, a smart AI brain, and lots of carefully mixed data—that lets a two-handed robot follow language instructions to do long, tricky tasks.
•It introduces ByteDexter V2, a compact 21-DoF robotic hand with fingertip touch sensors that can perform human-like grasps and bimanual maneuvers.
•The team built an easy teleoperation setup using a VR headset and data gloves so people can “puppet” the robot and quickly collect high-quality training data.
•The model is a 4-billion-parameter Mixture-of-Transformers VLA policy that outputs smooth chunks of future actions for both arms and both hands.
•Training blends four data sources—teleoperated robot trials, web-scale vision–language data, cross-robot datasets, and egocentric human hand data—so the robot learns both precision and generalization.
•A fingertip-centric motion retargeting pipeline lets skills learned on other robots and humans transfer to the ByteDexter V2 hands despite different joints and shapes.
•On a long, multi-step “makeup decluttering” task, GR-Dexter matched in-domain performance and stayed strong when the object layouts were rearranged (0.89 success vs. 0.64 for the baseline).
•On pick-and-place, GR-Dexter beat baselines on seen objects (0.93) and handled unseen objects and unseen instructions well (0.85 and 0.83).
•The system shows reliable tool use—like vacuuming with button presses and serving bread with tongs—demonstrating real-world dexterity.
•Limitations include modest human-video scale and separately controlled hands and arms; future work targets larger pretraining and tighter coordination.

Why This Research Matters

Robots that understand language and use finger-level dexterity can help with everyday chores that are too tedious, repetitive, or difficult for people. Safer, more capable assistants can support elder care, hospital logistics, and small businesses where gentle handling matters. In warehouses and kitchens, bimanual dexterity enables tool use, careful packaging, and fast adaptation to new items or instructions. For education and research, GR-Dexter shows a workable path to mix web knowledge with real-world touch and motion skills. Over time, such systems could reduce costs and improve access to high-quality help in homes and workplaces, while operating more reliably in messy, human-centered spaces.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how you can follow a recipe someone tells you out loud while looking at the ingredients on the counter? You hear instructions, you see the world, and you use your hands to act.

🥬 The Concept (Vision–Language–Action Models):

What it is: A Vision–Language–Action (VLA) model is an AI that looks (vision), listens/reads (language), and then moves (action) to complete tasks.
How it works: 1) Read an instruction like “put the cup in the box.” 2) Look at camera images to find the cup and box. 3) Plan how to move arms and hands. 4) Execute motions to finish the task. 5) Watch what happens and adjust next moves.
Why it matters: Without combining seeing, understanding, and doing, a robot might hear the instruction but not find the cup, or see the cup but not know what to do.

🍞 Bottom Bread (Anchor): When you say “pick up the red apple,” the robot focuses on the words “pick up” and “red apple,” looks for a red apple in the scene, grasps it, and places it where you asked.

🍞 Top Bread (Hook): Imagine trying to tie your shoes with giant salad tongs. You could grab things, but detailed, finger-y work would be very hard.

🥬 The Concept (Dexterous Hands vs. Grippers):

What it is: Dexterous robotic hands have many moving joints (degrees of freedom) like human fingers; grippers mostly just open and close.
How it works: 1) Multiple finger joints bend and spread. 2) The thumb opposes the fingers. 3) Many joints allow pinching, rolling, and reorienting objects in-hand.
Why it matters: Without finger-level dexterity, tasks like button pressing, tool use, or careful placement are clumsy or impossible.

🍞 Bottom Bread (Anchor): Using tongs to put a croissant on a plate is okay, but turning a tiny power button on a vacuum is far easier with a thumb and fingers.

🍞 Top Bread (Hook): Have you ever tried to find a toy when your hand is blocking your view of it?

🥬 The Concept (Occlusion in Perception):

What it is: Occlusion is when fingers or objects block the camera’s view, making it harder for the robot to see what it’s doing.
How it works: 1) Hands move close to small items. 2) The camera view gets covered. 3) Important contact details can be hidden.
Why it matters: Without handling occlusions, the robot may misjudge grasps or lose track of objects mid-task.

🍞 Bottom Bread (Anchor): When the robot pinches a small screw, its own fingers can hide the screw from the camera; the system must still act correctly.

🍞 Top Bread (Hook): Think of controlling a marionette with dozens of strings. More strings mean more control—but also more chances to tangle!

🥬 The Concept (Degrees of Freedom, DoF):

What it is: DoF are the independent ways a robot joint can move.
How it works: 1) Each finger joint adds a DoF. 2) More DoF = richer motions. 3) The action space grows rapidly as DoF increase.
Why it matters: Without managing high DoF, planning becomes slow and unstable, and the robot’s actions can be jerky or wrong.

🍞 Bottom Bread (Anchor): Two arms plus two dexterous hands can exceed 50 DoF—like steering many “strings” smoothly at once.

🍞 Top Bread (Hook): When you first learn a piano piece, it’s easiest if a teacher shows you where to place each finger.

🥬 The Concept (Teleoperation):

What it is: Teleoperation is when a person directly controls the robot—like puppeteering—to show it how to do tasks.
How it works: 1) Wear VR gear and gloves. 2) Move your hands naturally. 3) The robot mirrors those motions. 4) Record the robot’s sensors and actions.
Why it matters: Without teleoperation demos, the robot wouldn’t have enough high-quality examples to learn tricky, finger-level skills.

🍞 Bottom Bread (Anchor): A human uses VR gloves to show the robot how to grasp tongs, squeeze, and place a croissant.

Before this work, VLA policies mostly controlled simple grippers on robots. This limited delicate tasks, made occlusion problems smaller but not solved, and kept data collection relatively simple. Researchers tried scaling models or training on more gripper data, but those attempts didn’t give robots finger-skill or two-handed coordination. What was missing was a complete “triangle” of better hands (hardware), easier demonstration capture (teleoperation system), and smarter training that mixes many kinds of data (robot, web vision–language, other robots, and humans). The stakes are real: at home, in hospitals, and in small businesses, many tasks need fine finger control—buttoning, sorting, wiping, tool use, and careful placement—that grippers struggle to do reliably. GR-Dexter was created to bridge this gap by upgrading all three corners at once: the hand, the data, and the learning method.

02Core Idea

🍞 Top Bread (Hook): Imagine teaching a kid to cook: you talk them through the recipe, show photos of the steps, let them watch pros, and then let them practice in your kitchen with your tools.

🥬 The Concept (The “Aha!” of GR-Dexter):

What it is: The key idea is to pair a compact, human-like robotic hand with a VLA model that is co-trained on a pyramid of diverse data (robot, web vision–language, cross-robot, and human) so two-handed, finger-level skills can generalize.
How it works: 1) Build ByteDexter V2, a 21-DoF hand with touch sensors. 2) Collect clean teleoperated demos using VR gloves and a headset. 3) Add lots of web vision–language for broad understanding. 4) Add curated cross-embodiment robot data and human hand data. 5) Train a 4B-parameter Mixture-of-Transformers that outputs smooth action chunks for both arms and hands. 6) Use fingertip-centric retargeting so skills transfer across different hand designs.
Why it matters: Without co-training on rich, well-aligned sources, the model either overfits to a single robot or fails on new objects/instructions; without a capable hand, the best policy still can’t perform fine manipulation.

🍞 Bottom Bread (Anchor): After training, the robot can read “pick up the darkest object,” find it, grasp it with the right fingers, and place it precisely—even if that exact item or phrasing wasn’t in its teleop data.

Multiple Analogies:

Swiss Army Class: The model learns from robot demos (hands-on labs), web vision–language (textbook + pictures), other robots (exchange students), and humans (guest experts) to pass a very hard exam (real-life manipulation).
Sports Team: The hardware is the athlete’s body, the data are the drills and scrimmages, and the Mixture-of-Transformers is the coach crafting plays that work in many stadiums.
Translation: Fingertip alignment is like translating poetry by meaning, not word-by-word; it preserves the “contact intent” across very different hands.

Before vs. After:

Before: VLA policies were strong at following instructions but weak at finger-level dexterity and cross-embodiment generalization.
After: GR-Dexter executes long-horizon, bimanual tasks and robustly grasps unseen objects and follows new instructions.

Why It Works (Intuition):

Co-training the vision–language backbone on web tasks teaches broad visual grounding and flexible language understanding.
Training the action expert on robot and carefully retargeted cross-embodiment/human trajectories teaches contact geometry and control.
Fingertip-centric retargeting preserves the physics of touch—the where and how of contact—so joint differences don’t derail learning.
Action chunking plus smoothing turns high-DoF control into stable, coordinated motion rather than twitchy frame-by-frame commands.

Building Blocks (Sandwich explanations for key parts):

🍞 Top Bread (Hook): You know how different kids can all learn the same dance by focusing on the foot placements, even if they have different leg lengths?

🥬 The Concept (Cross-Embodiment Data):

What it is: Training data from other robots with different hands and bodies.
How it works: 1) Standardize camera views. 2) Keep only high-quality trials. 3) Align fingertip paths to ByteDexter V2. 4) Balance tasks by resampling.
Why it matters: Without it, the model doesn’t see enough diverse grasps and fails on new objects.

🍞 Bottom Bread (Anchor): A grasp from a 6-DoF hand gets converted so ByteDexter V2 can reproduce the same contact points.

🍞 Top Bread (Hook): Watching many chefs chop onions teaches you general knife skills, even if their knives and kitchens differ.

🥬 The Concept (Human Trajectories):

What it is: Egocentric human videos with tracked hand poses used as demonstrations.
How it works: 1) Filter for clear views and smooth motion. 2) Stabilize jitter. 3) Map to the same visual/kinematic format as robot data.
Why it matters: Without human data, you miss rare, skillful motions that are hard to teleoperate at scale.

🍞 Bottom Bread (Anchor): The model learns thumb-index pinches from human cooking clips and later applies them to pick small parts.

🍞 Top Bread (Hook): Think of a choir where different sections handle melodies, harmonies, and rhythm.

🥬 The Concept (Mixture-of-Transformers VLA):

What it is: A 4B-parameter architecture with specialized transformer experts handling language, vision, and action prediction.
How it works: 1) A vision–language backbone learns grounding via next-token prediction. 2) An action Diffusion-Transformer learns to output future action chunks via flow-matching. 3) The system conditions on instruction, images, and robot state.
Why it matters: Without specialized experts, the model struggles to juggle perception, language, and precise control.

🍞 Bottom Bread (Anchor): Given “put the corn in the box,” the vision–language part finds the corn and box; the action expert outputs smooth hand/arm trajectories to do it.

03Methodology

At a high level: Input (language + multi-view images + robot state) → Perception & grounding (vision–language backbone) → Action generation (Diffusion-Transformer outputs k-step action chunks) → Trajectory smoothing & execution on the 56-DoF bimanual robot.

Step-by-step (with Sandwich explanations for the key components):

Hardware platform: ByteDexter V2 on a dual-arm system 🍞 Top Bread (Hook): Imagine giving a skilled pianist a well-tuned, compact keyboard with extra-responsive keys.

🥬 The Concept (ByteDexter V2 Hand):

What it is: A 21-DoF, linkage-driven robotic hand with 5-DoF thumb and tactile fingertips, designed to be compact and robust.
How it works: 1) Four fingers (each 4 DoF) and a 5-DoF thumb enable opposition. 2) Some joints are underactuated with biomimetic linkages (DIP coupled to PIP). 3) Dense tactile arrays at fingertips measure contact force and location. 4) Motors are integrated in the palm for a small form factor.
Why it matters: Without enough DoF, opposition, and touch, in-hand manipulation and reliable grasps are fragile.

🍞 Bottom Bread (Anchor): The hand can perform 33 human-like grasp types and press a tiny vacuum power button while holding the tool.

Sensing and camera setup 🍞 Top Bread (Hook): When you film a play from multiple seats, you miss less action.

🥬 The Concept (Multi-view RGB-D Sensing):

What it is: One egocentric and three third-person RGB-D cameras capture the scene and reduce occlusions.
How it works: 1) Sync and calibrate cameras. 2) Standardize image sizes and crops. 3) Feed frames to the model for robust perception.
Why it matters: Without multiple views, fingers can hide objects, causing perception failures.

🍞 Bottom Bread (Anchor): As the thumb blocks the main camera, a side camera still sees the cup handle during grasp.

Teleoperation data collection 🍞 Top Bread (Hook): Playing a VR game where your hands in the game match your real hands exactly feels natural.

🥬 The Concept (VR Teleoperation Interface):

What it is: Meta Quest headset + Manus gloves + mounted controllers + foot pedals control two arms and two hands in real time.
How it works: 1) Track wrists and finger poses. 2) Retarget to robot joints via a whole-body controller. 3) Use constrained optimization (Sequential Quadratic Programming) with collision avoidance and regularization. 4) Record synchronized observations and actions.
Why it matters: Without accurate retargeting and safety, demos are slow, unsafe, or low-quality.

🍞 Bottom Bread (Anchor): An operator knits yarn with both hands after minimal practice, producing rich training traces.

Data pyramid and co-training 🍞 Top Bread (Hook): To learn a language well, you read books, listen to speakers, copy handwriting, and practice conversations.

🥬 The Concept (Four Data Sources):

What it is: A mix of 1) teleoperated robot trajectories, 2) web-scale vision–language data, 3) cross-embodiment robot data, and 4) human trajectories.
How it works: 1) Vision–language trains the backbone via next-token prediction. 2) Robot/cross-embodiment/human trajectories train the action DiT via flow-matching. 3) Mini-batches mix sources, summing losses.
Why it matters: Without this blend, models overfit or lack semantic breadth and dexterous priors.

🍞 Bottom Bread (Anchor): The model learns “darkest object” from web data and the precise pinch from robot/human demos to pick a dark marker.

Cross-embodiment and human retargeting 🍞 Top Bread (Hook): Translating dance moves by keeping footfalls and timing the same works across dancers of different heights.

🥬 The Concept (Fingertip-Centric Motion Retargeting):

What it is: A preprocessing pipeline that aligns contact geometry (fingertips) across different hands and data sources.
How it works: 1) Normalize camera views and object scales. 2) Curate high-quality trajectories. 3) Align fingertip positions/paths to ByteDexter V2. 4) Resample by task for balance. 5) For human data, filter on visibility/velocity and stabilize jitter.
Why it matters: Without contact-preserving alignment, joint differences break transfer and degrade control.

🍞 Bottom Bread (Anchor): A two-finger pinch from a different robot becomes a matching thumb–index pinch on ByteDexter V2 with the same contact points.

Action representation and generation 🍞 Top Bread (Hook): Planning a short musical phrase is easier than deciding every single note in real time.

🥬 The Concept (Chunked Action Generation with Smoothing):

What it is: The model predicts future action chunks over k steps; each step is an 88D vector including arm joints, end-effector poses, hand joints, and fingertip targets.
How it works: 1) Condition on language, images, and robot state. 2) The DiT trained with flow-matching proposes a sequence. 3) A trajectory optimizer smooths motions and ensures safe, continuous execution.
Why it matters: Without chunking and smoothing, high-DoF control becomes jittery and unstable, hurting delicate grasps.

🍞 Bottom Bread (Anchor): While placing a croissant with tongs, the hands move fluidly without sudden jerks thanks to smoothed, chunked actions.

Training objectives 🍞 Top Bread (Hook): In school, you learn to recognize words (reading) and also to write essays (generating), both improving your language skill.

🥬 The Concept (Dual Objectives: Next-Token + Flow-Matching):

What it is: The backbone learns vision–language via next-token prediction; the action expert learns control via flow-matching.
How it works: 1) Alternate/mix mini-batches from web V-L and trajectory data. 2) Compute both losses and sum them. 3) Update shared components jointly.
Why it matters: Without joint training, the model might understand language but fumble motions, or move well without understanding instructions.

🍞 Bottom Bread (Anchor): After co-training, the robot can interpret “pick up the drinkable object” and execute a stable grasp on the cup it identifies.

The Secret Sauce:

Contact-first retargeting preserves what matters physically (where fingers touch), making cross-robot and human data truly useful.
Action chunking plus trajectory smoothing stabilizes the massive 56-DoF control problem.
Multi-source co-training marries semantic breadth (web data) with dexterous precision (robot/human demos).

04Experiments & Results

The Test: The team evaluated two categories—(1) long-horizon bimanual dexterous manipulation and (2) generalizable pick-and-place—focusing on success rate: does the robot finish what the instruction asks?

The Competition: Baselines included a “plain VLA” trained only on robot teleop data and an ablation “GR-Dexter w/o cross-embodiment data.” GR-Dexter used the full mixed data pyramid.

Long-Horizon Dexterous Manipulation (Makeup Decluttering) 🍞 Top Bread (Hook): Think of cleaning a messy desk by following a checklist—open a drawer, sort items, place each where it belongs—step after step.

🥬 The Concept (Sequential, Multi-Step Evaluation):

What it is: The robot performs a sequence of natural-language subtasks (one per item), each starting from a home pose.
How it works: 1) Train with 20 hours of teleop data plus co-training with vision–language. 2) Test in two settings: Basic (seen layouts) and OOD (unseen layouts). 3) Measure average success across trials.
Why it matters: Without robust long-horizon control, small mistakes in early steps compound and derail the task.

🍞 Bottom Bread (Anchor): For decluttering, GR-Dexter consistently opens drawers, grasps various makeup items, and places them correctly even when their positions change.

Results:

Basic (seen layouts): Plain VLA = 0.96, GR-Dexter = 0.97. Translation: Both get an A; GR-Dexter stays strong while benefiting from co-training.
OOD (unseen layouts): Plain VLA drops to 0.64 (a D), while GR-Dexter achieves 0.89 (a solid B+ to A-). Co-training with vision–language clearly boosts layout generalization.

Qualitative Tool-Use Tasks:

Vacuuming: Stable four-finger grasp, thumb presses power, sweeps confetti; reliable over time.
Bread serving with tongs: One hand holds plate, the other operates tongs to move a croissant precisely.

Generalizable Pick-and-Place 🍞 Top Bread (Hook): Picture a teacher asking, “Please put the carrot in the box,” sometimes swapping in new items or using trickier phrases like “pick up the fruit.”

🥬 The Concept (Seen vs. Unseen Objects and Instructions):

What it is: Train with ~20 hours and 20 objects; test on seen objects (Basic), 23 unseen objects, and unseen instruction phrasings.
How it works: 1) Fix object layouts per batch across policies for fairness. 2) Success = pick target object and place in container.
Why it matters: Without generalization, the robot crumbles when objects or wording change.

🍞 Bottom Bread (Anchor): On “pick up the darkest object,” GR-Dexter finds and grasps a black marker it never trained on.

Results:

Basic (seen objects): Plain VLA = 0.87, GR-Dexter w/o cross-embodiment = 0.85, GR-Dexter = 0.93 (best). Even in-domain, the full data mix helps a bit.
Unseen objects: GR-Dexter reaches 0.85; baselines drop more, showing that cross-embodiment data add real robustness.
Unseen instructions: GR-Dexter hits 0.83, reflecting stronger language grounding from web-scale V-L co-training.

Surprising/Notable Findings:

Co-training sometimes makes optimization harder in purely in-distribution settings (slight dip for the ablation), but adding cross-embodiment data more than compensates, improving both robustness and peak performance.
Fingertip-centric retargeting appears to be a key enabler: grasp quality on unfamiliar objects improves markedly compared with models lacking cross-embodiment alignment.

Scoreboard Summary with Context:

Long-horizon OOD jump from 0.64 to 0.89 is like going from struggling to keep a routine on a new stage to almost nailing it.
Seen pick-and-place improvement to 0.93 is like raising a B+ to a solid A.
On unseen objects/instructions, staying in the low-to-mid 0.8s shows real-world-ready resilience rather than brittle overfitting.

05Discussion & Limitations

Limitations:

Human data scale: Only a few hundred hours of egocentric hand videos were used; much larger human datasets exist and could further help rare, delicate skills.
Hand–arm control split: Hands and arms are controlled by separate modules, which can hamper the tight, contact-rich coordination needed for tasks like in-hand reorientation under load.
Retargeting assumptions: Fingertip-centric alignment works well, but extreme shape/joint differences or strong palm contacts may need richer contact models and tactile priors.
Hardware cost/complexity: Dual arms with dexterous hands, multi-view cameras, and VR teleoperation are non-trivial to deploy broadly.

Required Resources:

Hardware: Two 7-DoF arms, two ByteDexter V2 hands with tactile sensing, multi-view RGB-D, VR headset, data gloves.
Compute: Training a 4B-parameter Mixture-of-Transformers with mixed objectives and large datasets.
Data curation: Careful filtering, standardization, and motion retargeting across embodiments and humans.

When NOT to Use:

Ultra-high-speed assembly where millisecond-level latency and micrometer precision are mandatory; specialized controllers/sensors may be better.
Environments without reliable sensing (e.g., poor lighting, heavy occlusion without multi-view), or where VR teleoperation is infeasible to bootstrap demos.
Tasks dominated by non-fingertip contacts (e.g., whole-palm forceful manipulation) unless the system is extended with broader tactile coverage and models.

Open Questions:

How far can fingertip-centric retargeting go? What’s the best way to incorporate palm/tendon/tactile fields for richer transfer?
Can unified hand–arm control with shared dynamics models and contact prediction improve in-hand reorientation and tool-use finesse?
What is the optimal curriculum for mixing web V-L, robot, and human data to speed learning without hurting in-domain optimization?
How can we cheaply scale teleoperation quality and diversity, perhaps via shared autonomy or simulation-to-real with tactile priors?

06Conclusion & Future Work

Three-Sentence Summary:

GR-Dexter is a full-stack system—new compact dexterous hands, a VR teleoperation pipeline, and a 4B-parameter Mixture-of-Transformers—trained on a pyramid of robot, web vision–language, cross-embodiment, and human data.
A contact-preserving retargeting pipeline lets skills transfer across different robots and from humans, while chunked, smoothed actions enable stable control of 56 DoF in bimanual tasks.
The system matches strong in-domain performance and substantially improves generalization to unseen layouts, objects, and instructions, succeeding in long-horizon tool use and pick-and-place.

Main Achievement:

Showing that co-training a VLA policy with carefully aligned cross-embodiment and human trajectories—on top of solid teleop robot data and a capable 21-DoF hand—produces robust, real-world dexterous bimanual manipulation.

Future Directions:

Scale egocentric human data and cross-embodiment corpora, extend tactile sensing and contact modeling beyond fingertips, and unify hand–arm control for tighter dexterity.
Explore embodiment-agnostic control abstractions and curricula that balance in-domain optimization with out-of-distribution robustness.

Why Remember This:

GR-Dexter demonstrates a practical recipe—better hands, better demos, better data mixing—that turns generalist language-following robots into finger-skilled doers, raising the ceiling on what home and workplace robots can handle.

Practical Applications

•Home assistance: Decluttering, sorting, and placing household items based on spoken instructions.
•Kitchen help: Using tongs, opening jars/drawers, and carefully plating food.
•Healthcare support: Fetching supplies, opening packages, and pressing device buttons gently.
•Retail and warehousing: Picking varied items and packing them safely despite changing layouts.
•Light manufacturing: Assembling or inserting small parts that require precise finger placement.
•Cleaning tasks: Operating small vacuum tools and wiping surfaces with careful pressure control.
•Education and training: A platform for teaching and studying dexterous manipulation and human–robot interaction.
•Teleoperation services: Remote experts can demonstrate niche tasks quickly via VR to update the robot’s skills.
•Laboratory assistance: Handling vials, caps, and instruments that need delicate manipulation.
•Event setup: Arranging objects and props while following high-level verbal plans.

Version: 1