RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
Key Summary
- âąRoboTracer is a vision-language model that turns tricky, word-only instructions into safe, step-by-step 3D paths (spatial traces) robots can follow.
- âąIt focuses on two hard skills: finding the right things in 3D (referring) and measuring real-world distances and sizes (measuring).
- âąA universal spatial encoder lets the model plug in extra geometry (like depth or camera info) when available for better accuracy.
- âąA scale decoder learns the sceneâs real size (meters, centimeters) using a special regression loss, so the robot knows if â5 cmâ is actually small in that photo.
- âąRoboTracer is trained first with supervised fine-tuning (SFT) and then with reinforcement fine-tuning (RFT) that uses metric-sensitive process rewards to guide multi-step reasoning.
- âąThe new TraceSpatial dataset (4.5M samples, 30M QAs) teaches 3D referring, measuring, and tracing with up to 9-step reasoning.
- âąOn tough tests, RoboTracer beats strong baselines: it reaches an average success rate of 79.1% on spatial tasks and outperforms Gemini-2.5-Pro by 36% on TraceSpatial-Bench.
- âąThe modelâs decoupled (u,v,d) point format makes it easy to handle both 2D and 3D tasks and to co-train with existing 2D datasets.
- âąIn real and simulated robots (UR5, G1 humanoid), RoboTracer plans collision-free, multi-step motions in cluttered, changing scenes.
- âąThis work narrows the gap between reading instructions and safely doing them in the physical world.
Why This Research Matters
Robots need to follow instructions safely in our messy, 3D world, not just answer questions about pictures. RoboTracer shows how to connect words to real, metric-aware motion plans that avoid collisions and obey distances like centimeters and meters. This unlocks reliable help in homes (tidying, setting tables), workplaces (packing, sorting), and hospitals (fetching items), even in crowded spaces. By using geometry when itâs available and learning the sceneâs scale directly, the model is more accurate and trustworthy. The process rewards make the reasoning steps dependable, so the robot doesnât just guess the final spot. Overall, this brings us closer to robots that donât just âsee,â but actually do the right thing, step by step, in the real world.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine youâre telling a friend, âWater the flowers from left to right, and keep the watering can just a few centimeters above each one.â Your friend must (1) figure out which pots count as âleft to right,â (2) know what âa few centimetersâ really is, and (3) move smoothly without bumping into things.
đ„Ź The Concept (Spatial Tracing, the big need): Spatial tracing is turning a language instruction into a safe, ordered list of 3D waypoints a robot can follow. How it works (in spirit):
- Read the instruction.
- Find the right objects in the scene and their order.
- Measure real-world distances and heights.
- Produce a 3D point-by-point path that follows the rule and avoids collisions. Why it matters: Without spatial tracing, robots might try to follow words directly and hit things or stop in the wrong place. đ Anchor: To âplace the blue mug to the right of the red bowl by 10 cm,â the robot must pick the right mug and bowl, know where ârightâ is, understand what 10 cm looks like in that camera view, and then move there safely.
đ Hook: You know how saying âthe second cookie from the leftâ only makes sense if you can actually see the line of cookies and count them correctly?
đ„Ź The Concept (3D Spatial Referring): 3D spatial referring means pointing to the correct object in real, three-dimensional space by understanding relationships like left/right/front/back/top/bottom. How it works:
- Look at all visible objects.
- Understand their 3D arrangement and order.
- Select the object that matches the words (e.g., âthe second flower from the leftâ). Why it matters: Without correct referring, the robot may grab the wrong flower or mix up ordering. đ Anchor: If asked to âgrab the third book from the top shelf,â the robot must know which books are on the top shelf and then pick book number three.
đ Hook: Think about using a ruler in a photo. If you donât know how big the photo is compared to real life, your measurement wonât be accurate.
đ„Ź The Concept (3D Spatial Measuring): 3D spatial measuring is estimating real distances, depths, sizes, and heights in the physical world from camera views. How it works:
- Learn the sceneâs scale (how pixels map to meters).
- Estimate depths and sizes of objects.
- Convert the plan into real-world units (cm, m). Why it matters: Without measuring, âmove 5 cmâ could turn into âmove way too farâ or âhardly move at all.â đ Anchor: For âhover the watering can 1â5 cm above each flower,â the robot must measure each flowerâs height and then add a small, real-world offset.
The world before: Many vision-language models could answer 2D questions (like âIs the cat left of the dog?â) or draw a 2D trace on an image. But they often:
- Ignored multi-step reasoning (finding several objects in sequence).
- Produced only 2D outputs (pixels), not true 3D points.
- Lacked real, absolute scale (so â10 cmâ had no reliable meaning).
The problem: Robots need a multi-step, metric-grounded plan: find the right objects in 3D (referring), measure distances and heights (measuring), then make a safe 3D path (spatial trace). Existing methods rarely did all three together.
Failed attempts: Prior methods either stayed in 2D (so no absolute depth), tried to guess 3D without solid scale cues, or skipped supervising the key steps (which object? what distance?). They often made floating or colliding paths.
The gap: What was missing was a model that (1) can plug in extra geometry when available (like depth and camera intrinsics), (2) explicitly learns real-world scale, and (3) is rewarded not just for final answers but also for correct intermediate steps (like âyou found the right flowerâ and âyou measured the right heightâ).
Real stakes: This matters for home helpers, warehouse bots, and humanoids. If a model understands âplace the phone 20 cm to the right of the chargerâ or âstack the box on that shelf without hitting the vase,â robots can actually do useful, safe tasks at home, in hospitals, and in factories.
02Core Idea
đ Hook: You know how baking is easier if your recipe has exact amounts (cups, grams) and clear steps like âfirst preheat, then mix, then bakeâ? A robot needs the same clarityâwhat to find, how far to move, and in what order.
đ„Ź The Concept (RoboTracer, the main idea): RoboTracer is a 3D-aware vision-language model that turns instructions into accurate, metric-grounded 3D paths by supervising both the key steps (referring and measuring) and the final trace. How it works:
- Use a universal spatial encoder to read images and optionally extra geometry (depth, intrinsics) for precise 3D understanding.
- Use a scale decoder to predict the sceneâs real metric scale.
- Train with supervised fine-tuning (SFT) on a big dataset (TraceSpatial) to learn 3D referring and measuring.
- Improve with reinforcement fine-tuning (RFT) using metric-sensitive process rewards so the model gets feedback on each step, not just the ending. Why it matters: Without these parts, robots miss steps, mix up objects, or misjudge distancesâleading to wrong or unsafe paths. đ Anchor: For âpick the rightmost orange, then place it 0.25 m to the right of the blue bowl,â RoboTracer selects the right orange, converts 0.25 m into the sceneâs scale, and generates a collision-free 3D route to get there.
Three analogies for the same idea:
- Recipe analogy: The universal spatial encoder is like knowing how to use any oven (extra geometry) if itâs available; the scale decoder is your measuring cup; RFT process rewards are the taste tests at each step.
- Map-and-compass: The encoder is your map that accepts landmarks (depth, intrinsics), the scale decoder is your distance scale bar, and process rewards are checkpoint stamps proving you didnât get lost.
- Orchestra: The encoder brings in instruments (geometry) when present, the scale decoder keeps everyone on tempo (real units), and process rewards are the conductorâs mid-performance notes.
đ Hook: Imagine youâre labeling points on a photo: where to start, where to go next, and how deep into the scene to move.
đ„Ź The Concept ((u,v,d) point format): Represent each waypoint as image x-position (u), y-position (v), and absolute depth (d in meters). How it works:
- Use (u,v) to locate the pixel.
- Add d to anchor it to the real-world depth.
- Convert to 3D using camera intrinsics if available. Why it matters: It simplifies learning, works with 2D-only data by dropping d, and cleanly upgrades to full 3D when geometry is present. đ Anchor: A 5-point trace might be [(176, 788, 0.945), âŠ], which the robot converts into a 3D flight path above flowers.
Before vs. after:
- Before: Models guessed in 2D, lacked scale, and struggled with multi-step reasoning.
- After: RoboTracer composes stepsâreferring, measuring, then tracingâin real units, using geometry when present and rewards that check progress.
Why it works (intuition):
- The universal spatial encoder grants 3D awareness and flexibly uses whatever geometry you have.
- The scale decoder forces the model to learn the size of the world, not just shapes.
- Process rewards teach the model to do the right things in the right order, not just to get a lucky final answer.
đ Hook: Think of a library full of practice puzzles that get steadily harder and include the answer keys and how to solve them step by step.
đ„Ź The Concept (TraceSpatial dataset): TraceSpatial is a massive, diverse set (4.5M samples, 30M QA pairs) that teaches 3D referring, 3D measuring, and multi-step spatial tracing, with up to 9-step reasoning and lots of absolute-scale cases. How it works:
- Curates clean 2D/3D/video data with accurate geometry and object descriptions.
- Builds multi-step annotations (which object? what distance? what order?).
- Includes both object-centric and end-effector-centric traces across robots. Why it matters: Without the right data, models canât learn metric-grounded, step-by-step reasoning. đ Anchor: The dataset includes instructions like âpick the leftmost can, go around the cup, then place it 0.25 m to the right of the yellow box,â plus precise start masks, end 3D boxes, and intermediate clues.
đ Hook: When building a house of cards, you check each layer as you goânot just at the end.
đ„Ź The Concept (Metric-sensitive process rewards in RFT): These are rewards during training that check intermediate steps (like whether you found the right object or measured the right height) using real-unit metrics. How it works:
- Define outcome rewards for final format and full-trajectory match.
- Add process rewards for step-wise referring and measuring accuracy.
- Optimize the model to maximize both, encouraging faithful reasoning. Why it matters: Without process rewards, the model may pass sometimes by luck but wonât consistently do the right steps. đ Anchor: While solving âplace the book above the highest stack,â the model gets points for correctly finding the highest stack first, not just for ending near the right place.
03Methodology
High-level recipe: Input (image + instruction, optional geometry) â Universal Spatial Encoder + RGB Encoder â Language Model with Scale Decoder â Output (reasoned steps + 3D (u,v,d) spatial trace).
Step-by-step:
- Inputs: The model takes an RGB image and the instruction text. If available, it also reads geometric inputs, such as camera intrinsics and depth. Why it exists: Real geometry tightens depth and distance estimates. Without it, the model might misjudge heights or distances. Example: For âhover can 1â5 cm above each flower,â camera intrinsics and depth help measure the flowersâ heights in meters.
đ Hook: Imagine swapping between rulers, measuring tapes, and blueprints depending on whatâs in your toolbox.
đ„Ź The Concept (Universal Spatial Encoder): A module that flexibly ingests optional geometric cues (depth, intrinsics, poses) to form better 3D features. How it works:
- Convert geometry to a consistent spatial representation.
- Fuse it with visual features to enhance 3D understanding.
- Pass it through a projector so the language model can âreadâ it. Why it matters: Without this encoder, the model canât fully benefit from extra geometry when itâs available. đ Anchor: With depth + intrinsics, the encoder helps tell that the âtop shelfâ is truly higher and farther, not just higher in pixels.
đ Hook: Think of setting the scale on a map so that 1 cm equals 1 kmâyou need the right scale to measure correctly.
đ„Ź The Concept (Scale Decoder): A small head that predicts a metric scale factor tied to a special <SCALE> token, trained with a regression loss instead of just text predictions. How it works:
- The language model emits <SCALE>.
- The decoder turns it into a numeric scale factor.
- A regression loss pulls it toward ground-truth scale, improving real-world size awareness. Why it matters: Without explicit scale learning, the model canât reliably distinguish 1 cm from 10 cm in a single image. đ Anchor: When asked to âmove the mug 0.25 m,â the scale decoder helps the model convert that into the correct number of pixels and depth.
SFT (Supervised Fine-tuning) in two phases:
- Metric Alignment: Train the spatial encoder projector and the scale decoder using geometry-rich parts of TraceSpatial. This teaches the model how to tie pixels to meters.
- Metric Enhancement: Freeze the spatial encoder; fine-tune other parts on mixed RGB and RGB+geometry data (plus general instruction data). This preserves broad skills while strengthening metric reasoning. Why it exists: A careful schedule gives the model a strong 3D base and then broadens its language and VQA ability without losing scale sense. Example: After SFT, the model better answers âHow tall is the left vase?â with realistic units.
đ Hook: Training a dog to fetch isnât just about the final fetchâyou reward each correct step, like finding the ball and bringing it back.
đ„Ź The Concept (RFT with metric-sensitive process rewards): A reinforcement stage that improves multi-step reasoning with both outcome and process feedback. How it works:
- Outcome rewards: check final format, start/end point match, and full-trajectory similarity.
- Process rewards: check each stepâs referring and measuring correctness (order-invariant) using real metric errors.
- Optimize with GRPO to balance both reward types. Why it matters: Without process checks, the model might guess a final point but fail to generalize complex tasks. đ Anchor: In âplace the pillow on the stool to the left of the highest shelf,â the model first proves it found the stool and the highest shelf, then nails the final placement.
đ Hook: You know how decimals (like 0.945 m) plus pixels make a precise target on screen and in the real world?
đ„Ź The Concept (Decoupled (u,v,d) representation): Each waypoint is a triple: image location (u,v) and absolute depth d, which can be lifted to 3D via camera intrinsics. How it works:
- (u,v) grounds the point in the image.
- d anchors it in real-world distance.
- Drop d to get 2D traces; keep start/end only for referring tasks. Why it matters: This unifies 2D and 3D tasks and simplifies data reuse and co-training. đ Anchor: The same flower-watering trace can be evaluated in 2D (pixels) or in 3D (meters) just by including or omitting d.
Example walk-through (the watering task):
- Read: âWater flowers from left to right with the can hovering 1â5 cm above each.â
- Refer: Find the flowers in left-to-right order (3D spatial referring).
- Measure: Estimate each flowerâs height and add 1â5 cm (3D spatial measuring + scale).
- Trace: Output an ordered sequence of (u,v,d) points that follow the hover band, avoiding collisions.
- Verify: Process rewards check you found the right flowers and the right heights before grading the final path.
04Experiments & Results
The tests (what they measured): The team evaluated spatial understanding (2D/3D relations), spatial measuring (depth, distances, sizes), 2D spatial referring, visual trace prediction, and full multi-step 3D spatial tracing with collision checks. They also tested general VLM skills so the model wouldnât forget common sense.
The competition (who they compared against): Strong baselines like Gemini-2.5-Pro, Qwen-3-VL (4B/8B), NVILA (2B/8B), Molmo, RoboBrain 2.0, and more. These are well-known, capable VLMs.
Scoreboard with context:
- Spatial understanding and measuring: RoboTracer-8B-SFT, trained on TraceSpatial, reached an average of 85.7% across benchmarks and beat Gemini-2.5-Pro by 8.58 percentage points and NVILA-8B by 20.3 points, with especially large gains on 3D and measurement tasks (23.6% relative vs. 14.7% on 2D tasks). Thatâs like jumping from a solid B to an A+ on the hardest questions.
- Overall spatial success: Across multiple spatial tasks, RoboTracer averaged 79.1% success, ahead of strong baselines.
- 2D spatial referring and visual trace: Despite being designed for 3D, RoboTracer achieved top scores here too. The (u,v,d) design let it reuse data and co-train seamlessly with 2D formats, lifting its 2D accuracy.
- TraceSpatial-Bench (the hard new benchmark): Real indoor/tabletop images with careful geometry, start masks, end 3D boxes, and 3â8 reasoning steps. RoboTracer beat Gemini-2.5-Pro by 36 percentage points. It also improved further when given explicit geometry (intrinsics, depth), showing the universal spatial encoder pays off.
- General VLM benchmarks: Joint training preserved (and sometimes improved) common sense and visual QA performance.
- Real robots: In simulator and real-world tests (UR5 arm, G1 humanoid), RoboTracer executed long, multi-step tasks in cluttered, changing scenes. For example, it adapted when the ârightmost hamburgerâ changed mid-task and produced collision-free traces.
Surprising findings:
- Adding precise geometry (when available) gave up to 6% absolute gains on hard tracing, suggesting that explicit, real measurements are still golden.
- Regression supervision for scale decoding outperformed text-only supervision and no supervisionâemphasizing the benefit of treating scale as a number to predict, not just a word to say.
- Process rewards notably improved 3D tracing versus outcome-only rewards; 2D gains alone didnât translate to reliable 3D without process guidance.
What it means: The combinationâuniversal spatial encoder, explicit scale learning, and process-aware RFTâmade the model more accurate, more reliable in steps, and safer on long-horizon tasks, not just better at final answers.
05Discussion & Limitations
Limitations:
- If no geometric inputs are available and the scene is visually tricky (poor lighting, reflections, thin structures), absolute metric estimates can still drift.
- The model is trained for spatial operations; tasks far outside spatial tracing (e.g., reading fine text, medical imaging) arenât its focus.
- Very dynamic scenes with fast motion blur, extreme occlusions, or moving cameras may stress the metric grounding.
- Process rewards rely on step annotations; domains without such annotations may see smaller RFT gains.
Required resources:
- For best results: RGB images plus any available geometry (camera intrinsics, depth), and a GPU to run the VLM.
- For training: Access to large-scale data (TraceSpatial) and compute for SFT and RFT.
When not to use it:
- If only a 2D overlay is needed (no 3D safety or metric precision), a simpler 2D tracer may suffice.
- For tasks driven mostly by language or world knowledge (no spatial plan), a regular VLM is often enough.
- If precise calibration is impossible and the scene is highly reflective or textureless, metric tracing may be unreliable.
Open questions:
- How to make metric grounding robust under extreme lighting, motion blur, or moving cameras without extra sensors?
- Can we reduce reliance on annotated step-wise rewards by learning process supervision from weaker signals (e.g., self-checks)?
- How to fuse multi-view or short video clips efficiently to strengthen 3D consistency?
- Can the same ideas scale to mobile robots navigating entire homes or warehouses with equal reliability?
Big picture: RoboTracer shows that explicit scale learning, geometry-plugging, and process-aware rewards push VLMs from âseeingâ to âdoing safelyâ in the real world.
06Conclusion & Future Work
Three-sentence summary: RoboTracer is a 3D-aware vision-language model that turns natural language into accurate, collision-free 3D paths by mastering 3D spatial referring and measuring. It does this with a universal spatial encoder, an explicitly trained scale decoder, and reinforcement fine-tuning that rewards correct intermediate steps, all powered by the large TraceSpatial dataset. The result is state-of-the-art spatial tracing across benchmarks and real robots, with especially strong gains in metric-grounded, multi-step reasoning.
Main achievement: Showing that combining geometry-aware encoding, explicit scale regression, and metric-sensitive process rewards converts language into reliable 3D action plans, closing the loop from instruction to safe motion.
Future directions:
- Make metric grounding robust under motion, glare, and low light; add lightweight multi-view fusion.
- Learn process supervision from weaker, cheaper signals; scale to mobile navigation and multi-robot coordination.
- Expand TraceSpatial to cover more tools, terrains, and dynamic obstacles.
Why remember this: Itâs a blueprint for teaching robots not just to understand instructions but to carry them out safely in the real worldâstep by step, in real units, and without collisions.
Practical Applications
- âąHome assistance: Place items precisely (e.g., âPut the remote 20 cm to the right of the TV boxâ) without bumping decorations.
- âąKitchen help: Move containers along shelves while maintaining safe heights and avoiding collisions with cookware.
- âąWarehouse picking: Find the correct bin (third from the left), measure the right offset, and plan a clean 3D path.
- âąHospital logistics: Deliver supplies to exact spots (e.g., 10 cm from the edge) in narrow, busy corridors.
- âąRetail restocking: Place products with measured spacing and consistent alignment along shelves.
- âąAssembly tasks: Stack components on the correct fixture and maintain clearance from nearby tools or parts.
- âąCleaning tasks: Wipe or scan surfaces along metric-guided paths (e.g., sweep 30 cm strips) while avoiding obstacles.
- âąEducation and labs: Demonstrate geometry-aware motions for STEM classes and robot training labs.
- âąAgriculture: Navigate rows (e.g., âthird plant from the rightâ) and keep tools at safe heights above leaves.
- âąHumanoid assistance: Execute multi-step, collision-free tasks (e.g., watering plants left-to-right at 1â5 cm height).