EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Yu Bai; MingMing Yu; Chaojie Li; Ziyi Bai; Xinlong Wang; Börje F. Karlsson

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Intermediate

Yu Bai, MingMing Yu, Chaojie Li et al.2/4/2026

arXiv PDF

Key Summary

•EgoActor is a vision-language model that turns everyday instructions like 'Go to the door and say hi' into step-by-step, egocentric actions a humanoid robot can actually do.
•It unifies four action types—movement, head look-around (active perception), hand manipulation, and talking/gesturing to people—so the robot can smoothly switch between them.
•The model uses two kinds of language actions: precise, structured phrases for moving and looking, and open-ended natural language for manipulating objects and talking to people.
•Trained on a large mix of egocentric videos, spatial reasoning Q&A, virtual simulations, and a bit of on-robot experience, EgoActor learns strong spatial sense from only RGB images.
•On a real Unitree G1 humanoid, EgoActor shows better doorway traversing (fewer bumps), reliable person-approach and greeting/asking, and solid support for mobile manipulation.
•Compared with strong navigation baselines, EgoActor is much better at stopping at the right spot for interaction instead of endlessly wandering.
•Both 4B and 8B model sizes run in under one second per decision; the 8B is stronger at fine person disambiguation while the 4B stays snappier and close in performance.
•The 'secret sauce' is expressing low-level motor moves as short, structured text, letting a language model plan, align, and time actions directly from first-person video.
•Limitations include reliance on external low-level skills, weaker very-long-horizon memory, and missing crouch/stand skills on the real robot setup.
•The team is releasing code, models, datasets, and benchmarks to help others build on this approach.

Why This Research Matters

Humanoid robots are stepping into homes and offices, where tight doorways, cluttered desks, and real people make tasks tricky. EgoActor helps robots stop in the right spot and do the next right thing—like greet someone or pick up a cup—by turning words directly into precise, egocentric actions. This reduces collisions, wasted motion, and awkward interactions, making robots feel more helpful and natural. Because it runs on RGB-only cameras and standard hardware, it lowers costs and complexity for real deployments. As the open-source release spreads, we can expect faster progress on assistive robots, office couriers, and collaborative helpers in everyday spaces.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re playing a first-person video game. You only see what’s in front of you. To reach a goal, you must walk, turn your head, press a button, and talk to characters—all at just the right times. That’s exactly what a real humanoid robot faces in the real world.

🥬 Filling (The Actual Concept): What the world looked like before: Robots were good at single skills—like walking steadily or picking things up—when the scene was simple and the task was short. But in real homes and offices, the world is messy, partly hidden, and always changing. Robots need to move, look around, use hands, and talk to people—and switch between those quickly. How it worked before (and what broke):

Many systems used predefined libraries of skills. They could plan, “First navigate, then pick,” but they didn’t tightly connect what they saw right now with exactly how far to move or where to look.
Vision-and-Language Navigation models followed directions to go places but often didn’t stop at the perfect spot for grabbing or talking.
Planners could list steps, yet the final few inches of movement, the exact head tilt, and the timing of a grasp still failed under clutter and doorways. Why it matters: Without tight coordination, robots bump doorframes, stop too far from a desk to reach, or talk to the wrong person wearing the wrong shirt.

🍞 Bottom Bread (Anchor): Think of “Go through the doorway, greet the person in the brown shirt, and pick up the cup.” That’s three different action types, each needing correct timing and precise positioning from the robot’s own camera view.

🍞 Top Bread (Hook): You know how a selfie camera shows only what’s in front of you? A robot’s main clue about the world comes from its own camera, too.

🥬 The Concept: Egocentric observation means the robot sees from its own eyes (first person) and must judge distances, angles, and obstacles from that view only. How it works:

The robot watches a short history of its camera frames.
It remembers the last few actions it took.
It decides the next tiny, precise step: how far to move, how much to turn, where to look. Why it matters: If the robot can’t decide from a first-person view, it will miss corners, clip obstacles, or fail to line up the hand with the cup.

🍞 Anchor: Standing at a door, the robot must turn slightly left, step forward 0.3 meters, then glance down to clear the threshold before entering.

🍞 Top Bread (Hook): You know how following a recipe means reading words and then doing actions with your hands while looking at the ingredients?

🥬 The Concept: A Vision-Language Model (VLM) reads instructions and looks at images to choose actions. How it works:

It takes in a task sentence (like “Approach the desk and pick up the pink cup”).
It sees recent and past frames from the robot’s camera.
It predicts short action phrases that tell the robot exactly how to move, look, use hands, or talk. Why it matters: Without a VLM, words and pictures don’t come together, and the robot won’t know how to turn language into precise motion.

🍞 Anchor: The VLM sees the desk is a bit right of center and says “Left sidewalk 0.40 meters; Turn left 20 degrees; Move forward 0.5 meters; Pick up the pink cup.”

🍞 Top Bread (Hook): Imagine packing your backpack. You turn it around, peek inside, tilt it to see better. That’s active looking, not just staring.

🥬 The Concept: Active perception means the robot moves its head (and sometimes body) to gather better information for the next step. How it works:

Scan: Look up/down/left/right when details are missing.
Focus: Keep the target in view as you get closer.
Verify: Glance down to avoid bumping into things. Why it matters: If the robot never adjusts its view, it’ll miss obstacles, misread targets, and fail at tight tasks.

🍞 Anchor: While exiting a room, the robot peeks down to spot the door lip, then looks ahead to aim through the center.

🍞 Top Bread (Hook): If you ask a friend for directions, you face them, stand a comfy distance away, and speak clearly.

🥬 The Concept: Human-robot interaction means the robot approaches the right person and speaks or gestures in sensible, polite ways. How it works:

Navigate to the person wearing the specified clothes.
Stop about one meter away, facing them.
Speak the requested sentence (e.g., “Could you show me the reception?”). Why it matters: If the robot stops too far, faces the wrong way, or asks the wrong person, the interaction fails.

🍞 Anchor: “Approach the person in a brown shirt and say hi.” The robot centers on the brown shirt and says, “Hi there!”

🍞 Top Bread (Hook): When you reach for a cup, you don’t stand across the room; you walk close, line up your hand, and then grab.

🥬 The Concept: Manipulation requires the robot to position itself and orient its head so a hand skill can succeed. How it works:

Approach to the right distance and angle.
Keep the target stable in view.
Trigger the hand/arm controller at the right moment. Why it matters: If timing or distance is off by just a bit, the grasp fails.

🍞 Anchor: “Approach and pick up the pink cup.” The robot edges closer in small steps, centers the cup, then triggers the grasp.

🍞 Top Bread (Hook): Ever slide sideways through a crowded hallway to avoid bumping into people?

🥬 The Concept: Traversability is about safely moving through tight spaces without collisions. How it works:

Read doorway width and frame positions from the camera.
Combine small turns, forward steps, and side steps.
Adjust view to verify clearance. Why it matters: Without traversability, robots clip doorframes or hesitate and never pass.

🍞 Anchor: To exit a storage room, the robot sidesteps right 0.4 m, turns 10°, and moves forward 0.3 m, clearing the frame smoothly.

🍞 Top Bread (Hook): When you do a long task, each little move depends on the last one.

🥬 The Concept: Temporal action prediction is choosing the next best small action based on recent frames and recent moves. How it works:

Look at the last few images and actions.
Predict a short sequence like “Turn; Move; Look.”
Repeat quickly in a loop. Why it matters: Without this, decisions feel random and jerky instead of smooth and purposeful.

🍞 Anchor: While rounding a corner, the robot predicts “Turn left 15°; Move forward 0.4 m; Left sidewalk 0.3 m,” then repeats until aligned.

02Core Idea

🍞 Top Bread (Hook): Picture a great orchestra: strings, brass, woodwinds, percussion. If they don’t play together on the beat, the music falls apart. A humanoid robot has its own ‘sections’: walking, head-looking, hand-using, and talking. They must stay in sync.

🥬 The Concept: Aha! Use a vision-language model to directly speak low-level, egocentric action phrases that coordinate movement, active perception, manipulation, and human interaction—so the robot can fluently turn words into precise motion. How it works (big picture):

Input: A natural language instruction and the robot’s recent and historical camera frames.
Reason: The model aligns words with what it sees to understand distances, angles, and which action type should come next.
Output: Short, structured action phrases for moving/looking and natural language commands for manipulating/talking. Why it matters: Without this unification, robots treat steps separately and stumble at the handoffs—stopping too soon for a grasp, missing door frames, or talking to the wrong person.

🍞 Bottom Bread (Anchor): For “Enter the room on your right and say hi,” the model outputs: “Turn right 25°; Move forward 0.6 m; Right sidewalk 0.3 m; Look down 8°; Move forward 0.4 m; Say ‘Hi there!’”

🍞 Top Bread (Hook): Think of three different kid-friendly analogies:

Conductor analogy: The model is the conductor keeping all sections (move, look, use hands, speak) in rhythm.
GPS-plus-Head-Turner analogy: It’s like a GPS that also tells you exactly how much to tilt your head to check a mirror.
Video Game Micro-Commands: It’s like a game helper that whispers tiny, precise moves: “nudge right; look up a tad; press grab—now.”

🥬 The Concept: Why it works: the action language is the glue. How it works:

Structured Language Actions (SLAs): Tiny, interpretable commands like “Turn left 30.5°” and “Move forward 0.26 m” give precise control.
Natural Language Actions (NLAs): Open-ended text like “Pick up the bottle” or “Ask ‘Where is the kitchen?’” integrate manipulation and talk.
Egocentric grounding: Training on first-person videos teaches the model to read space from a single camera view. Why it matters: SLAs give millimeter/degree-level positioning; NLAs let the robot be flexible and social; together, they bridge words-to-motors.

🍞 Bottom Bread (Anchor): In a crowded desk scene, the model inches forward in smaller and smaller steps, looks down a bit to verify, then says “Pick up the pink cup,” triggering the hand policy.

🍞 Top Bread (Hook): Imagine fitting a couch through a doorway: you shuffle, rotate, peek, and move. Each micro-move depends on the last.

🥬 The Concept: Before vs. After. Before: Navigation models could follow routes, but they often didn’t stop perfectly to interact; manipulation policies needed the robot already in the right spot. After: EgoActor plans and times both—moving, looking, and then triggering the hand or speech—so the final action succeeds more often. Why it matters: The difference between “close” and “correct” is what makes the robot feel helpful instead of clumsy.

🍞 Bottom Bread (Anchor): Baseline agents sometimes keep walking past the doorway or spin too much; EgoActor slides through, stops, faces the person, and speaks.

🍞 Top Bread (Hook): You know how a good coach simplifies a hard sport into simple drills you can do fast?

🥬 The Concept: Building blocks.

Inputs: instruction + recent image-action pairs + long visual history + list of available skills.
Spatial grounding: trained from egocentric human videos, simulated rooms, and spatial reasoning Q&A, so it learns what 0.4 m looks like.
Action vocabulary: SLAs for precise moving/looking; NLAs for hands/talking.
Routing & parsing: SLAs become velocities/angles; NLAs get sent to speech, gestures, or a manipulation VLA policy.
Real-time loop: Sub-second decisions make movement smooth, not stop-and-go. Why it matters: If any block is missing, timing, precision, or understanding breaks.

🍞 Bottom Bread (Anchor): With “Approach and grab the apple,” inputs show the apple drifting right in view; the model outputs a small right strafe, a tiny forward step, then “Pick up the apple.”

03Methodology

High-level recipe: Instruction + Egocentric images + Recent actions → VLM reasoning → Action phrases (SLAs/NLAs) → Parse & Execute on a humanoid.

Step 1. Frame the task as EgoActing

What happens: The robot gets a high-level but explicit instruction (e.g., “Enter the room on your right and say hi to the person”), plus a handful of historical frames and the last 2–3 observation-action pairs.
Why it exists: The recent pairs teach the model the motion trend (e.g., we just turned right a bit), and the history gives long-term context.
Example: Instruction: “Approach and pick up the orange on the desk.” Recent: image shows the desk closer than before; last action was “Move forward 0.5 m.”

Step 2. Use a general VLM backbone (Qwen3-VL) with LoRA finetuning

What happens: Start from a strong vision-language transformer with dynamic image resolution. Apply LoRA to finetune linear layers efficiently.
Why it exists: Keeps the architecture simple and scalable; LoRA reduces compute/memory while adapting the model to action prediction.
Example: A 4B or 8B model runs decisions under 1 second.

Step 3. Multi-scale, multi-frame input formatting

What happens: Use 10 lower-res historical frames (240p) and 3 recent higher-res frames (480p) to balance cost and detail.
Why it exists: The model needs context (history) and precision (recent frames). Without both, it either forgets where it came from or misjudges small moves.
Example: The older frames show a hallway getting wider; the recent frames show the doorway edge in detail.

Step 4. Teach two action types: SLAs and NLAs

What happens: The model predicts short, structured phrases for moving/looking (SLAs) and open language for manipulation/talking (NLAs).
Why it exists: SLAs give exact geometry; NLAs keep interaction/manipulation flexible and expressive.
Example: “Turn right 20.0 degrees; Move forward 0.40 meters; Ask ‘Could you show me the reception?’”

Step 5. Parse and route actions to the robot

What happens: A simple parser extracts numbers from SLAs and turns them into velocity/angle commands. NLAs route by keywords: speech → TTS; gestures → presets; all other NLAs → manipulation VLA.
Why it exists: The robot needs machine-ready commands; without parsing, language won’t drive motors.
Example: “Look down 8.0 degrees” becomes a pitch angle for the 2-DoF head.

Step 6. Train with broad, scalable supervision

What happens: Mix data from many sources: • Egocentric Internet videos (EgoTaskQA + more), • Local real videos with changing layouts, • Virtual rooms (VLN-CE/Habitat) for controlled navigation, • Spatial reasoning (MindCube), • Visual-language understanding (GQA), • High-level planning (RoboVQA, EgoPlan, ALFRED), • Unsupervised movement pairs (learn motion between two frames), • A bit of real on-policy DAgger data.
Why it exists: Each source teaches a different piece: spatial sense, instruction following, precise movement, and real-world quirks. Without this mix, the model overfits or misses key skills.
Example: MindCube sharpens 3D reasoning; VLN-CE teaches room layouts; local videos add hallway/doorway realism.

Step 7. Extract precise movement labels from videos

What happens: Use MASt3R to estimate camera pose and derive small movement deltas every ~1.5 s; merge opposite or tiny motions by thresholds (e.g., <5° or <0.1 m gets dropped).
Why it exists: Video rarely has ground-truth wheel odometry; this recovers usable SLAs from plain RGB footage.
Example: A head cam moving around a table yields “Turn left 12°; Move forward 0.3 m; Left sidewalk 0.2 m.”

Step 8. Balance the action distribution

What happens: Oversample turning actions and NLAs so the model doesn’t bias toward long straight walking.
Why it exists: Real-life needs many turns and frequent interaction/manipulation; without balance, the model forgets to stop and act.
Example: The training batch includes extra samples where the final step is “Pick up the bottle” or “Say hi.”

Step 9. Merge low-level actions into smooth commands (virtual data)

What happens: In simulators, combine discrete steps (turn/forward/look) into larger, human-like micro-plans with small random jitters for robustness.
Why it exists: Smoothness reduces stop-start jitter. Randomness prevents overfitting to exact angles/distances.
Example: Two tiny left turns merge into “Turn left 18°,” followed by “Move forward 0.5 m.”

Step 10. Close the loop on a real humanoid

What happens: Execute SLAs using the Unitree G1 locomotion controller (with tuned precision ~5 cm, ~5°). Run sub-second VLM inference repeatedly, updating from new frames.
Why it exists: Tight sensing–action–sensing loops make behavior smooth and collision-safe. Without low latency, the robot over- or under-shoots.
Example: As the doorway nears, the loop adjusts: a tiny sidestep, a small turn, a short forward.

Step 11. Trigger hands and talk at the right time

What happens: When the view and distance look right, the model emits an NLA (e.g., “Pick up the apple” or “Ask ‘Where is the kitchen?’”). The system sends it to the right executor.
Why it exists: Timing matters. Too early → grasp fails or speech sounds odd; too late → wasted time or collisions.
Example: The model only triggers “Pick up the pink cup” once the cup is well-centered and close.

Secret Sauce

Expressing low-level moves as readable mini-sentences lets the language model plan and align geometry and timing from egocentric video.
Mixing real human videos with simulation and Q&A grows spatial intuition without extra sensors.
Active perception is a first-class action: the model moves its gaze to reason better, not just to record video.

04Experiments & Results

The Tests

Human–Robot Interaction: Approach a specified person (by clothing, posture, etc.) and perform the correct social action (say hi, ask for location, request an item).
Mobile Manipulation: Navigate to a desk, approach a target object (seen or unseen category), and trigger the manipulation at the right moment (pick/place).
Traversability: Enter and exit real rooms through narrow doorways from different starting sides, avoiding collisions.
Virtual EgoActing: In unseen simulated rooms, stop close to the right place and output the correct natural-language action.

The Competition

Strong navigation baselines: NaVid, Uni-NaVid, NaVILA (VLM-based navigation systems).
Same robot and camera setup where applicable; EgoActor judged on both movement and correct interaction/manipulation triggers.

The Scoreboard (with context)

Human–Robot Interaction (real-world): • Single-person tasks: EgoActor-4B and 8B succeeded in all tested approach-and-interact tasks (12/12 each type); baselines only checked approach, not social actions. • Multi-person disambiguation (out-of-distribution attributes): 8B clearly stronger (e.g., clothing/accessories/posture/direction/gender: 10–12/12) than 4B (7–11/12). This is like the 8B student acing fine-detail identification when the 4B student still does well but misses some trick questions.
Mobile Manipulation (real-world): • Unseen desk layout; mix of in-distribution (apple/bottle) and out-of-distribution (pen holder/pink cup) targets. • EgoActor-8B reached correct spots and triggered grasp/place reliably (5–6/6 in most categories). 4B sometimes fired manipulation slightly too early (2–5/6), like stopping a step short of the table.
Traversability (real-world doorways): • EgoActor-4B/8B: typically 10–12/12 successes in seen rooms and 7–8/8 in unseen rooms for left/right, entering/leaving. • Baselines: far lower success, with frequent doorframe bumps or needless spins before passing. Think “A grade” for EgoActor where baselines hover around “C” in tight spaces.
Virtual EgoActing (unseen scenes): • Stopping distance within 3.0 m: EgoActor-4B ~87–89%, 8B ~89–91%; baselines ~51–60%. • Under stricter thresholds (≤1.0 m), EgoActor kept strong leads (≈70%+), while baselines dropped sharply (≈20% range). • Natural-language action F1: positive (≈0.60–0.62) for EgoActor; near zero/negative for baselines, reflecting much better intent-to-action grounding and stopping behavior.

Surprising/Notable Findings

The 4B model trails 8B in fine-grained person disambiguation but stays close on many navigation metrics, keeping speed and responsiveness—with sub-second inference for both.
Baselines look fine if you only ask, “Did you eventually get near the goal?” but they often fail the real test: “Did you stop at the right spot and then do the next right thing?”
EgoActor shows human-like micro-moves (small strafes and combined turn+forward) and active gaze adjustments that boost safe doorway passing and smoother transitions to manipulation.
Failure cases often stem from ambiguous virtual instructions or unusual scenes (e.g., churches) rather different from training data; in the real world, occasional side swipes happen when small nearby obstacles slip out of view during a bigger avoidance move.

05Discussion & Limitations

Limitations

Modular dependence: EgoActor coordinates decisions but relies on external low-level skills (walking controller, manipulation VLA, TTS) and a base VLM. If any module fails, behavior suffers.
Long-horizon memory: With very extended, multi-stage missions, the model can settle into locally reasonable but globally wrong patterns.
Real-world action set: Stand/crouch were only used in simulation; the real walking policy didn’t support them in this study.
RGB-only sensing: No depth or tactile sensing; extremely cluttered 3D geometry or transparent objects can be tricky.

Required Resources

Hardware: A humanoid (Unitree G1 in the paper) with a single RGB camera (e.g., RealSense D455) and a 2-DoF head; dexterous hands (e.g., Unitree Dex3-1) for manipulation tasks.
Compute: 4B/8B VLM inference on an onboard or nearby machine; training originally used multiple A100 GPUs.
Software: Simple action parser, TTS, gesture presets, and a manipulation VLA policy (e.g., GROOT-N 1.5 finetuned).

When NOT to Use

Safety-critical tasks without robust low-level controllers and safety monitors; minor missteps could be costly.
Environments requiring depth/tactile precision (e.g., threading a needle, grasping transparent/glossy items) where RGB-only may be unreliable.
Ultra-long missions with many sub-goals when persistent memory/planning across hours is needed.

Open Questions

Can we integrate locomotion, manipulation, and language into a single end-to-end model that reduces module handoffs?
How to add compact, reliable long-term memory (maps, landmarks, people) without slowing real-time decisions?
How to incorporate safety layers and uncertainty estimates so the robot knows when to slow down or ask for help?
How to adapt quickly to new robots and new homes/offices with minimal extra data?
Can we enrich RGB with cheap extra sensing (audio, proprioception) to handle tricky cases (glass, mirrors) while keeping simplicity?

06Conclusion & Future Work

Three-Sentence Summary EgoActor turns high-level instructions into egocentric, low-level action phrases that coordinate walking, head movements, manipulation triggers, and human interaction—all in real time from only RGB images. By expressing motion as short, structured language and interaction/manipulation as natural language, it bridges planning and precise motor execution, improving doorway traversal, person approach-and-talk, and mobile manipulation. The 4B/8B models generalize well in both real and virtual unseen settings and run with sub-second latency.

Main Achievement A unified, vision-language-driven way to ground words into spatially aware micro-actions, closing the gap between abstract plans and concrete, timed motor steps on a humanoid robot.

Future Directions

Fold external skills into a more end-to-end system with built-in safety and longer memory.
Expand the action set (e.g., real-world crouch/stand) and add lightweight sensing to tackle transparent or reflective objects.
Broaden datasets and benchmarks for richer social interactions, multi-room missions, and collaborative tasks.

Why Remember This EgoActor shows that using language itself as the carrier of precise motor intent is a powerful, scalable trick: it makes tight, real-world egocentric control feel natural to a VLM. This lets humanoids pass doorways cleanly, stop in the right spot, and then do the next right thing—pick, place, or politely ask—bringing practical home-and-office robots a step closer.

Practical Applications

•Office greeter: Guide visitors through doors to the reception and politely ask/answer questions.
•Facility runner: Fetch a labeled item from a desk, place it on a cart, and return through narrow hallways.
•Retail restocking: Approach shelves, pick/place lightweight goods, and ask staff for aisle updates.
•Elderly assistance: Navigate doorways safely, approach a person, and deliver verbal reminders or fetch small items.
•Event support: Move through crowds carefully, locate a staff member by clothing, and relay messages.
•Hospital courier: Carry supplies between rooms, stop at the correct bedside, and confirm delivery verbally.
•Janitorial helper (light tasks): Approach surfaces and trigger wiping or switch toggling where appropriate.
•Education demo bot: Follow classroom instructions, greet students, and point to objects on command.
•Warehouse runner (lightweight lanes): Traverse narrow passages, align with bins, and request help when unsure.
•Front-desk rover: Escort guests to rooms, pass through doors smoothly, and ask for confirmation on arrival.

Version: 1