RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

Enshen Zhou; Cheng Chi; Yibo Li; Jingkun An; Jiayuan Zhang; Shanyu Rong; Yi Han; Yuheng Ji; Mengzhen Liu; Pengwei Wang; Zhongyuan Wang; Lu Sheng; Shanghang Zhang

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

Intermediate

Enshen Zhou, Cheng Chi, Yibo Li et al.12/15/2025

arXiv PDF

Key Summary

•RoboTracer is a vision-language model that turns tricky, word-only instructions into safe, step-by-step 3D paths (spatial traces) robots can follow.
•It focuses on two hard skills: finding the right things in 3D (referring) and measuring real-world distances and sizes (measuring).
•A universal spatial encoder lets the model plug in extra geometry (like depth or camera info) when available for better accuracy.
•A scale decoder learns the scene’s real size (meters, centimeters) using a special regression loss, so the robot knows if “5 cm” is actually small in that photo.
•RoboTracer is trained first with supervised fine-tuning (SFT) and then with reinforcement fine-tuning (RFT) that uses metric-sensitive process rewards to guide multi-step reasoning.
•The new TraceSpatial dataset (4.5M samples, 30M QAs) teaches 3D referring, measuring, and tracing with up to 9-step reasoning.
•On tough tests, RoboTracer beats strong baselines: it reaches an average success rate of 79.1% on spatial tasks and outperforms Gemini-2.5-Pro by 36% on TraceSpatial-Bench.
•The model’s decoupled (u,v,d) point format makes it easy to handle both 2D and 3D tasks and to co-train with existing 2D datasets.
•In real and simulated robots (UR5, G1 humanoid), RoboTracer plans collision-free, multi-step motions in cluttered, changing scenes.
•This work narrows the gap between reading instructions and safely doing them in the physical world.

Why This Research Matters

Robots need to follow instructions safely in our messy, 3D world, not just answer questions about pictures. RoboTracer shows how to connect words to real, metric-aware motion plans that avoid collisions and obey distances like centimeters and meters. This unlocks reliable help in homes (tidying, setting tables), workplaces (packing, sorting), and hospitals (fetching items), even in crowded spaces. By using geometry when it’s available and learning the scene’s scale directly, the model is more accurate and trustworthy. The process rewards make the reasoning steps dependable, so the robot doesn’t just guess the final spot. Overall, this brings us closer to robots that don’t just “see,” but actually do the right thing, step by step, in the real world.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re telling a friend, “Water the flowers from left to right, and keep the watering can just a few centimeters above each one.” Your friend must (1) figure out which pots count as “left to right,” (2) know what “a few centimeters” really is, and (3) move smoothly without bumping into things.

🥬 The Concept (Spatial Tracing, the big need): Spatial tracing is turning a language instruction into a safe, ordered list of 3D waypoints a robot can follow. How it works (in spirit):

Read the instruction.
Find the right objects in the scene and their order.
Measure real-world distances and heights.
Produce a 3D point-by-point path that follows the rule and avoids collisions. Why it matters: Without spatial tracing, robots might try to follow words directly and hit things or stop in the wrong place. 🍞 Anchor: To “place the blue mug to the right of the red bowl by 10 cm,” the robot must pick the right mug and bowl, know where “right” is, understand what 10 cm looks like in that camera view, and then move there safely.

🍞 Hook: You know how saying “the second cookie from the left” only makes sense if you can actually see the line of cookies and count them correctly?

🥬 The Concept (3D Spatial Referring): 3D spatial referring means pointing to the correct object in real, three-dimensional space by understanding relationships like left/right/front/back/top/bottom. How it works:

Look at all visible objects.
Understand their 3D arrangement and order.
Select the object that matches the words (e.g., “the second flower from the left”). Why it matters: Without correct referring, the robot may grab the wrong flower or mix up ordering. 🍞 Anchor: If asked to “grab the third book from the top shelf,” the robot must know which books are on the top shelf and then pick book number three.

🍞 Hook: Think about using a ruler in a photo. If you don’t know how big the photo is compared to real life, your measurement won’t be accurate.

🥬 The Concept (3D Spatial Measuring): 3D spatial measuring is estimating real distances, depths, sizes, and heights in the physical world from camera views. How it works:

Learn the scene’s scale (how pixels map to meters).
Estimate depths and sizes of objects.
Convert the plan into real-world units (cm, m). Why it matters: Without measuring, “move 5 cm” could turn into “move way too far” or “hardly move at all.” 🍞 Anchor: For “hover the watering can 1–5 cm above each flower,” the robot must measure each flower’s height and then add a small, real-world offset.

The world before: Many vision-language models could answer 2D questions (like “Is the cat left of the dog?”) or draw a 2D trace on an image. But they often:

Ignored multi-step reasoning (finding several objects in sequence).
Produced only 2D outputs (pixels), not true 3D points.
Lacked real, absolute scale (so “10 cm” had no reliable meaning).

The problem: Robots need a multi-step, metric-grounded plan: find the right objects in 3D (referring), measure distances and heights (measuring), then make a safe 3D path (spatial trace). Existing methods rarely did all three together.

Failed attempts: Prior methods either stayed in 2D (so no absolute depth), tried to guess 3D without solid scale cues, or skipped supervising the key steps (which object? what distance?). They often made floating or colliding paths.

The gap: What was missing was a model that (1) can plug in extra geometry when available (like depth and camera intrinsics), (2) explicitly learns real-world scale, and (3) is rewarded not just for final answers but also for correct intermediate steps (like “you found the right flower” and “you measured the right height”).

Real stakes: This matters for home helpers, warehouse bots, and humanoids. If a model understands “place the phone 20 cm to the right of the charger” or “stack the box on that shelf without hitting the vase,” robots can actually do useful, safe tasks at home, in hospitals, and in factories.

02Core Idea

🍞 Hook: You know how baking is easier if your recipe has exact amounts (cups, grams) and clear steps like “first preheat, then mix, then bake”? A robot needs the same clarity—what to find, how far to move, and in what order.

🥬 The Concept (RoboTracer, the main idea): RoboTracer is a 3D-aware vision-language model that turns instructions into accurate, metric-grounded 3D paths by supervising both the key steps (referring and measuring) and the final trace. How it works:

Use a universal spatial encoder to read images and optionally extra geometry (depth, intrinsics) for precise 3D understanding.
Use a scale decoder to predict the scene’s real metric scale.
Train with supervised fine-tuning (SFT) on a big dataset (TraceSpatial) to learn 3D referring and measuring.
Improve with reinforcement fine-tuning (RFT) using metric-sensitive process rewards so the model gets feedback on each step, not just the ending. Why it matters: Without these parts, robots miss steps, mix up objects, or misjudge distances—leading to wrong or unsafe paths. 🍞 Anchor: For “pick the rightmost orange, then place it 0.25 m to the right of the blue bowl,” RoboTracer selects the right orange, converts 0.25 m into the scene’s scale, and generates a collision-free 3D route to get there.

Three analogies for the same idea:

Recipe analogy: The universal spatial encoder is like knowing how to use any oven (extra geometry) if it’s available; the scale decoder is your measuring cup; RFT process rewards are the taste tests at each step.
Map-and-compass: The encoder is your map that accepts landmarks (depth, intrinsics), the scale decoder is your distance scale bar, and process rewards are checkpoint stamps proving you didn’t get lost.
Orchestra: The encoder brings in instruments (geometry) when present, the scale decoder keeps everyone on tempo (real units), and process rewards are the conductor’s mid-performance notes.

🍞 Hook: Imagine you’re labeling points on a photo: where to start, where to go next, and how deep into the scene to move.

🥬 The Concept ((u,v,d) point format): Represent each waypoint as image x-position (u), y-position (v), and absolute depth (d in meters). How it works:

Use (u,v) to locate the pixel.
Add d to anchor it to the real-world depth.
Convert to 3D using camera intrinsics if available. Why it matters: It simplifies learning, works with 2D-only data by dropping d, and cleanly upgrades to full 3D when geometry is present. 🍞 Anchor: A 5-point trace might be [(176, 788, 0.945), …], which the robot converts into a 3D flight path above flowers.

Before vs. after:

Before: Models guessed in 2D, lacked scale, and struggled with multi-step reasoning.
After: RoboTracer composes steps—referring, measuring, then tracing—in real units, using geometry when present and rewards that check progress.

Why it works (intuition):

The universal spatial encoder grants 3D awareness and flexibly uses whatever geometry you have.
The scale decoder forces the model to learn the size of the world, not just shapes.
Process rewards teach the model to do the right things in the right order, not just to get a lucky final answer.

🍞 Hook: Think of a library full of practice puzzles that get steadily harder and include the answer keys and how to solve them step by step.

🥬 The Concept (TraceSpatial dataset): TraceSpatial is a massive, diverse set (4.5M samples, 30M QA pairs) that teaches 3D referring, 3D measuring, and multi-step spatial tracing, with up to 9-step reasoning and lots of absolute-scale cases. How it works:

Curates clean 2D/3D/video data with accurate geometry and object descriptions.
Builds multi-step annotations (which object? what distance? what order?).
Includes both object-centric and end-effector-centric traces across robots. Why it matters: Without the right data, models can’t learn metric-grounded, step-by-step reasoning. 🍞 Anchor: The dataset includes instructions like “pick the leftmost can, go around the cup, then place it 0.25 m to the right of the yellow box,” plus precise start masks, end 3D boxes, and intermediate clues.

🍞 Hook: When building a house of cards, you check each layer as you go—not just at the end.

🥬 The Concept (Metric-sensitive process rewards in RFT): These are rewards during training that check intermediate steps (like whether you found the right object or measured the right height) using real-unit metrics. How it works:

Define outcome rewards for final format and full-trajectory match.
Add process rewards for step-wise referring and measuring accuracy.
Optimize the model to maximize both, encouraging faithful reasoning. Why it matters: Without process rewards, the model may pass sometimes by luck but won’t consistently do the right steps. 🍞 Anchor: While solving “place the book above the highest stack,” the model gets points for correctly finding the highest stack first, not just for ending near the right place.

03Methodology

High-level recipe: Input (image + instruction, optional geometry) → Universal Spatial Encoder + RGB Encoder → Language Model with Scale Decoder → Output (reasoned steps + 3D (u,v,d) spatial trace).

Step-by-step:

Inputs: The model takes an RGB image and the instruction text. If available, it also reads geometric inputs, such as camera intrinsics and depth. Why it exists: Real geometry tightens depth and distance estimates. Without it, the model might misjudge heights or distances. Example: For “hover can 1–5 cm above each flower,” camera intrinsics and depth help measure the flowers’ heights in meters.

🍞 Hook: Imagine swapping between rulers, measuring tapes, and blueprints depending on what’s in your toolbox.

🥬 The Concept (Universal Spatial Encoder): A module that flexibly ingests optional geometric cues (depth, intrinsics, poses) to form better 3D features. How it works:

Convert geometry to a consistent spatial representation.
Fuse it with visual features to enhance 3D understanding.
Pass it through a projector so the language model can “read” it. Why it matters: Without this encoder, the model can’t fully benefit from extra geometry when it’s available. 🍞 Anchor: With depth + intrinsics, the encoder helps tell that the “top shelf” is truly higher and farther, not just higher in pixels.

🍞 Hook: Think of setting the scale on a map so that 1 cm equals 1 km—you need the right scale to measure correctly.

🥬 The Concept (Scale Decoder): A small head that predicts a metric scale factor tied to a special <SCALE> token, trained with a regression loss instead of just text predictions. How it works:

The language model emits <SCALE>.
The decoder turns it into a numeric scale factor.
A regression loss pulls it toward ground-truth scale, improving real-world size awareness. Why it matters: Without explicit scale learning, the model can’t reliably distinguish 1 cm from 10 cm in a single image. 🍞 Anchor: When asked to “move the mug 0.25 m,” the scale decoder helps the model convert that into the correct number of pixels and depth.

SFT (Supervised Fine-tuning) in two phases:

Metric Alignment: Train the spatial encoder projector and the scale decoder using geometry-rich parts of TraceSpatial. This teaches the model how to tie pixels to meters.
Metric Enhancement: Freeze the spatial encoder; fine-tune other parts on mixed RGB and RGB+geometry data (plus general instruction data). This preserves broad skills while strengthening metric reasoning. Why it exists: A careful schedule gives the model a strong 3D base and then broadens its language and VQA ability without losing scale sense. Example: After SFT, the model better answers “How tall is the left vase?” with realistic units.

🍞 Hook: Training a dog to fetch isn’t just about the final fetch—you reward each correct step, like finding the ball and bringing it back.

🥬 The Concept (RFT with metric-sensitive process rewards): A reinforcement stage that improves multi-step reasoning with both outcome and process feedback. How it works:

Outcome rewards: check final format, start/end point match, and full-trajectory similarity.
Process rewards: check each step’s referring and measuring correctness (order-invariant) using real metric errors.
Optimize with GRPO to balance both reward types. Why it matters: Without process checks, the model might guess a final point but fail to generalize complex tasks. 🍞 Anchor: In “place the pillow on the stool to the left of the highest shelf,” the model first proves it found the stool and the highest shelf, then nails the final placement.

🍞 Hook: You know how decimals (like 0.945 m) plus pixels make a precise target on screen and in the real world?

🥬 The Concept (Decoupled (u,v,d) representation): Each waypoint is a triple: image location (u,v) and absolute depth d, which can be lifted to 3D via camera intrinsics. How it works:

(u,v) grounds the point in the image.
d anchors it in real-world distance.
Drop d to get 2D traces; keep start/end only for referring tasks. Why it matters: This unifies 2D and 3D tasks and simplifies data reuse and co-training. 🍞 Anchor: The same flower-watering trace can be evaluated in 2D (pixels) or in 3D (meters) just by including or omitting d.

Example walk-through (the watering task):

Read: “Water flowers from left to right with the can hovering 1–5 cm above each.”
Refer: Find the flowers in left-to-right order (3D spatial referring).
Measure: Estimate each flower’s height and add 1–5 cm (3D spatial measuring + scale).
Trace: Output an ordered sequence of (u,v,d) points that follow the hover band, avoiding collisions.
Verify: Process rewards check you found the right flowers and the right heights before grading the final path.

04Experiments & Results

The tests (what they measured): The team evaluated spatial understanding (2D/3D relations), spatial measuring (depth, distances, sizes), 2D spatial referring, visual trace prediction, and full multi-step 3D spatial tracing with collision checks. They also tested general VLM skills so the model wouldn’t forget common sense.

The competition (who they compared against): Strong baselines like Gemini-2.5-Pro, Qwen-3-VL (4B/8B), NVILA (2B/8B), Molmo, RoboBrain 2.0, and more. These are well-known, capable VLMs.

Scoreboard with context:

Spatial understanding and measuring: RoboTracer-8B-SFT, trained on TraceSpatial, reached an average of 85.7% across benchmarks and beat Gemini-2.5-Pro by 8.58 percentage points and NVILA-8B by 20.3 points, with especially large gains on 3D and measurement tasks (23.6% relative vs. 14.7% on 2D tasks). That’s like jumping from a solid B to an A+ on the hardest questions.
Overall spatial success: Across multiple spatial tasks, RoboTracer averaged 79.1% success, ahead of strong baselines.
2D spatial referring and visual trace: Despite being designed for 3D, RoboTracer achieved top scores here too. The (u,v,d) design let it reuse data and co-train seamlessly with 2D formats, lifting its 2D accuracy.
TraceSpatial-Bench (the hard new benchmark): Real indoor/tabletop images with careful geometry, start masks, end 3D boxes, and 3–8 reasoning steps. RoboTracer beat Gemini-2.5-Pro by 36 percentage points. It also improved further when given explicit geometry (intrinsics, depth), showing the universal spatial encoder pays off.
General VLM benchmarks: Joint training preserved (and sometimes improved) common sense and visual QA performance.
Real robots: In simulator and real-world tests (UR5 arm, G1 humanoid), RoboTracer executed long, multi-step tasks in cluttered, changing scenes. For example, it adapted when the “rightmost hamburger” changed mid-task and produced collision-free traces.

Surprising findings:

Adding precise geometry (when available) gave up to 6% absolute gains on hard tracing, suggesting that explicit, real measurements are still golden.
Regression supervision for scale decoding outperformed text-only supervision and no supervision—emphasizing the benefit of treating scale as a number to predict, not just a word to say.
Process rewards notably improved 3D tracing versus outcome-only rewards; 2D gains alone didn’t translate to reliable 3D without process guidance.

What it means: The combination—universal spatial encoder, explicit scale learning, and process-aware RFT—made the model more accurate, more reliable in steps, and safer on long-horizon tasks, not just better at final answers.

05Discussion & Limitations

Limitations:

If no geometric inputs are available and the scene is visually tricky (poor lighting, reflections, thin structures), absolute metric estimates can still drift.
The model is trained for spatial operations; tasks far outside spatial tracing (e.g., reading fine text, medical imaging) aren’t its focus.
Very dynamic scenes with fast motion blur, extreme occlusions, or moving cameras may stress the metric grounding.
Process rewards rely on step annotations; domains without such annotations may see smaller RFT gains.

Required resources:

For best results: RGB images plus any available geometry (camera intrinsics, depth), and a GPU to run the VLM.
For training: Access to large-scale data (TraceSpatial) and compute for SFT and RFT.

When not to use it:

If only a 2D overlay is needed (no 3D safety or metric precision), a simpler 2D tracer may suffice.
For tasks driven mostly by language or world knowledge (no spatial plan), a regular VLM is often enough.
If precise calibration is impossible and the scene is highly reflective or textureless, metric tracing may be unreliable.

Open questions:

How to make metric grounding robust under extreme lighting, motion blur, or moving cameras without extra sensors?
Can we reduce reliance on annotated step-wise rewards by learning process supervision from weaker signals (e.g., self-checks)?
How to fuse multi-view or short video clips efficiently to strengthen 3D consistency?
Can the same ideas scale to mobile robots navigating entire homes or warehouses with equal reliability?

Big picture: RoboTracer shows that explicit scale learning, geometry-plugging, and process-aware rewards push VLMs from “seeing” to “doing safely” in the real world.

06Conclusion & Future Work

Three-sentence summary: RoboTracer is a 3D-aware vision-language model that turns natural language into accurate, collision-free 3D paths by mastering 3D spatial referring and measuring. It does this with a universal spatial encoder, an explicitly trained scale decoder, and reinforcement fine-tuning that rewards correct intermediate steps, all powered by the large TraceSpatial dataset. The result is state-of-the-art spatial tracing across benchmarks and real robots, with especially strong gains in metric-grounded, multi-step reasoning.

Main achievement: Showing that combining geometry-aware encoding, explicit scale regression, and metric-sensitive process rewards converts language into reliable 3D action plans, closing the loop from instruction to safe motion.

Future directions:

Make metric grounding robust under motion, glare, and low light; add lightweight multi-view fusion.
Learn process supervision from weaker, cheaper signals; scale to mobile navigation and multi-robot coordination.
Expand TraceSpatial to cover more tools, terrains, and dynamic obstacles.

Why remember this: It’s a blueprint for teaching robots not just to understand instructions but to carry them out safely in the real world—step by step, in real units, and without collisions.

Practical Applications

•Home assistance: Place items precisely (e.g., “Put the remote 20 cm to the right of the TV box”) without bumping decorations.
•Kitchen help: Move containers along shelves while maintaining safe heights and avoiding collisions with cookware.
•Warehouse picking: Find the correct bin (third from the left), measure the right offset, and plan a clean 3D path.
•Hospital logistics: Deliver supplies to exact spots (e.g., 10 cm from the edge) in narrow, busy corridors.
•Retail restocking: Place products with measured spacing and consistent alignment along shelves.
•Assembly tasks: Stack components on the correct fixture and maintain clearance from nearby tools or parts.
•Cleaning tasks: Wipe or scan surfaces along metric-guided paths (e.g., sweep 30 cm strips) while avoiding obstacles.
•Education and labs: Demonstrate geometry-aware motions for STEM classes and robot training labs.
•Agriculture: Navigate rows (e.g., “third plant from the right”) and keep tools at safe heights above leaves.
•Humanoid assistance: Execute multi-step, collision-free tasks (e.g., watering plants left-to-right at 1–5 cm height).

Version: 1