VLS: Steering Pretrained Robot Policies via Vision-Language Models

Shuo Liu; Ishneet Sukhvinder Singh; Yiqing Xu; Jiafei Duan; Ranjay Krishna

VLS: Steering Pretrained Robot Policies via Vision-Language Models

Intermediate

Shuo Liu, Ishneet Sukhvinder Singh, Yiqing Xu et al.2/3/2026

arXiv PDF

Key Summary

•Robots often learn good hand motions during training but get confused when the scene or the instructions change at test time, even a little bit.
•This paper introduces Vision–Language Steering (VLS), which adapts a frozen robot policy on the fly without any retraining.
•VLS asks a vision–language model (VLM) to understand the scene and the instruction, pick important 3D keypoints, and write tiny differentiable reward functions that score action proposals.
•These rewards gently push a diffusion or flow-matching policy during denoising, so the sampled actions fit the new scene and task.
•To stay both smart and flexible, VLS mixes three tricks: gradient guidance (precise nudges), Feynman–Kac resampling (keep the good candidates), and repulsive forces (keep candidates diverse).
•VLS also runs in closed loop: it adjusts guidance strength automatically and switches task stages with a Schmitt-trigger rule to avoid flip-flopping.
•Across benchmarks, VLS beats prior steering methods, improving success by up to 31% on CALVIN and 13% on LIBERO-PRO.
•On a real Franka robot, VLS keeps working under new object looks, new positions, and new goals—no policy finetuning needed.
•Main tradeoff: better performance needs more sampling, which adds latency; still, it’s practical and training-free.
•Big idea: don’t reteach the robot—steer the skills it already has to fit the new situation.

Why This Research Matters

Homes, hospitals, and factories change every day: objects move, new tools appear, and instructions vary. VLS lets a robot handle these changes on the spot by steering what it already knows, instead of sending it back for retraining. This makes robots faster to deploy, safer around people and clutter, and cheaper to maintain. Because rewards are written by a VLM from live camera views and language, the system can flex to new scenes and goals without new datasets. The training-free nature also lowers the barrier for smaller labs and companies to achieve robust behavior. In short, VLS helps turn reliable, adaptable robot help from a research demo into an everyday reality.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how: Once you learn to place a cup in the middle of a table, you can also place it near the edge or inside a cabinet without learning from scratch. Your hands already know how to move; you just adjust for the new spot. What the world was like before: Robots learned from lots of examples (called imitation learning). This works great when the test looks like the training: same table, same object spot, same instruction. Recent robot policies based on diffusion or flow-matching are especially good at this. But small changes at test time—like the target moves a bit, there’s clutter, or the words in the instruction change—can make the robot fail. It’s not because the robot lacks basic motor skills; it’s because those skills are tied to the training setup. The problem: Out-of-distribution (OOD) situations. The robot faces a picture–instruction pair it never saw before. The old way is to retrain or finetune. That’s expensive and odd because the needed motions are already in the robot’s head—the robot just can’t pick and shape them for the new scene. Failed attempts: People tried three main things.

Re-optimize with VLMs at test time: build a new plan by rolling out simulations or searching. Powerful, but slow and not great for real-time control.
Use critics or dynamics models to guide sampling: helpful, but you end up optimizing toward the critic/model, which can push the robot off what the base policy knows best.
Filter or select with VLMs or humans: you can pick among candidates, but the feedback is sparse (approve/reject), so it’s sample-inefficient and struggles with fine, continuous corrections. The gap: We need a way to adapt during inference that (1) keeps the base policy frozen, (2) understands open-world scenes and language, and (3) gives smooth, dense nudges to the action sampling—like a GPS for the robot’s already-learned skills. Real stakes: In homes and factories, objects move, new items appear, and instructions vary. A robot that needs retraining each time is impractical. A robot that can adjust on the fly is safer (avoids bumping into new clutter), faster to deploy (no retraining), and more useful (handles new tasks and layouts). That’s what this paper delivers: Vision–Language Steering (VLS), a way to steer what the robot already knows—right when it acts. — New concept 1 — Out-of-Distribution (OOD) 🍞 Hook: Imagine practicing basketball in one gym, then playing in another with different lighting and floor lines. Your moves are fine; the setting changed. 🥬 The Concept: OOD means the test scene or instruction doesn’t match the training examples.
How it works: 1) The robot sees a new layout or wording; 2) Its policy expects familiar patterns; 3) Small mismatches snowball; 4) Actions miss their mark.
Why it matters: Without handling OOD, even good policies stumble when anything shifts. 🍞 Anchor: The robot learned “put the block in the center.” Now the center is occupied, and the edge is requested—it hesitates or collides. —

02Core Idea

The “Aha!” in one sentence: Don’t reteach the robot—use a VLM to write small, differentiable rewards that gently steer the frozen policy’s sampling so its actions fit the new scene and task. Three analogies for the same idea:

GPS for skills: The robot already knows how to drive; VLS is the GPS that says “turn a little left now” to match the new road.
Music conductor: The orchestra (policy) knows all the notes; the conductor (VLS) cues dynamics and timing to suit today’s hall and audience.
Editing clay: The base policy molds a shape; VLS adds tiny pushes to refine it into the needed figure for this specific scene. Before vs. After:

Before: Frozen policies sampled actions guided only by what they learned, so they stuck to training-era spatial assumptions and instructions.
After: The same policies get live, scene-aware nudges from rewards that a VLM composes, so samples bend toward success in new layouts and goals—without any weight updates. Why it works (intuition):
Diffusion/flow policies generate actions by “denoising” noise into trajectories. If you can score a partial trajectory and say “better if closer to keypoint X and aligned with Y,” you can push the next denoising step in a helpful direction. Do this each step, and the whole trajectory drifts toward satisfying the current scene and instruction.
VLMs are good at open-world perception and language. If we ask them to reduce the scene to a small set of keypoints and write differentiable scoring rules (rewards), we can compute gradients—tiny arrows showing how to adjust actions to improve the score.
Mixing gradient nudges (precise, local) with resampling and diversity (broad, global) balances stability and exploration. Building blocks (with sandwich explanations): — New concept 2 — Vision–Language Models (VLMs) 🍞 Hook: You know how you can look at a picture and read a sentence about it and instantly connect the two? 🥬 The Concept: A VLM understands images and words together.
How it works: 1) See the scene; 2) Read the instruction; 3) Identify relevant objects and relations; 4) Output structured hints (like keypoints or rules).
Why it matters: We need open-world understanding to tell the robot what matters right now. 🍞 Anchor: “Put the mug on the green plate.” The VLM finds the mug, the green plate, and their positions. — New concept 3 — Spatial Keypoints (Grounding) 🍞 Hook: When you aim to hang a picture, you mark the wall spot with a small dot. 🥬 The Concept: Keypoints are a compact set of 3D spots that capture what’s important for the task.
How it works: 1) Detect/segment objects; 2) Extract features; 3) Project to 3D; 4) Cluster to pick a few meaningful points.
Why it matters: Keypoints shrink the problem so rewards can focus on the right geometry. 🍞 Anchor: For “place mug on plate,” keypoints might be mug handle tip, mug base center, and plate center. — New concept 4 — Reward Functions (Differentiable) 🍞 Hook: Think of a video game score that goes up when you’re closer to the goal. 🥬 The Concept: A reward function gives a number showing how well an action proposal fits the task—and is smooth enough to differentiate.
How it works: 1) VLM decomposes task into stages; 2) Writes small PyTorch functions that softly reward desired distances/alignments; 3) We compute gradients w.r.t. the action.
Why it matters: Gradients tell us which way to tweak the action during denoising. 🍞 Anchor: Reward = higher when mug base is above plate center and gripper is level. — New concept 5 — Denoising Diffusion 🍞 Hook: Imagine a blurry photo becoming clear a little bit at a time. 🥬 The Concept: Diffusion starts from noise and gradually “cleans” it into a sample—in our case, an action trajectory.
How it works: 1) Start with random actions; 2) Predict a denoising step; 3) Repeat many times; 4) End with a plausible action sequence.
Why it matters: If we can nudge each cleaning step, we can shape the final actions. 🍞 Anchor: Each denoising step moves the gripper path a bit closer to the plate center. — New concept 6 — Flow Matching 🍞 Hook: Think of a river guiding leaves downstream along a smooth path. 🥬 The Concept: Flow matching learns a velocity field that smoothly moves noise to a clean sample.
How it works: 1) Define a continuous time; 2) Learn the flow that carries samples; 3) Integrate along time to get an action.
Why it matters: We can add a guidance vector to the flow to steer where the actions go. 🍞 Anchor: The velocity field points the gripper smoothly toward the plate. — New concept 7 — Gradient-Based Steering 🍞 Hook: Like a coach whispering, “a bit more to the left.” 🥬 The Concept: Use the reward’s gradient to slightly adjust the denoising/flow step.
How it works: 1) Score current action proposal; 2) Compute gradient; 3) Add scaled gradient to the model’s step; 4) Iterate.
Why it matters: Tiny, continuous nudges produce big, reliable alignment over time. 🍞 Anchor: The gradient says “reduce distance to plate center,” so the next sample step moves that way. — New concept 8 — Feynman–Kac Resampling 🍞 Hook: When baking cookies, you keep the best-shaped ones and redo the bad ones. 🥬 The Concept: Keep a batch of action proposals, weight them by reward, and resample so good ones multiply.
How it works: 1) Compute reward-based weights; 2) Probabilistically copy high-reward proposals; 3) Drop low-reward ones; 4) Continue denoising.
Why it matters: Prevents wasting time on bad directions and boosts promising ones. 🍞 Anchor: Good mug-to-plate trajectories get more copies; awkward ones get replaced. — New concept 9 — Repulsive Forces (Diversity) 🍞 Hook: Friends spread out at a picnic so everyone has room. 🥬 The Concept: Add a gentle “spread out” force so multiple proposals don’t collapse to the same idea too soon.
How it works: 1) Measure pairwise closeness; 2) Push candidates apart early; 3) Keep coverage; 4) Later focus with rewards.
Why it matters: Diversity finds better solutions in tricky scenes. 🍞 Anchor: Some candidates try over-plate-center, others try near-edge—then the best wins via rewards. — New concept 10 — Adaptive Guidance Strength 🍞 Hook: Like turning the steering wheel more when you’re far off course and less when you’re nearly straight. 🥬 The Concept: Automatically scale how strong the gradient nudges are based on recent reward.
How it works: 1) Compare current reward to a baseline; 2) If worse, increase guidance; 3) If better, ease off; 4) Repeat each chunk.
Why it matters: Big corrections when needed; gentle touch near success. 🍞 Anchor: If mug drifts off target, guidance ramps up; as it aligns, guidance relaxes. — New concept 11 — Stage Switching (Schmitt Trigger) 🍞 Hook: A light with two click points avoids flicker—you need a clear push to turn it on or off. 🥬 The Concept: Use two thresholds (high/low) to decide when to move to the next task stage without wobbling back and forth.
How it works: 1) If reward > high, advance stage; 2) If between, hold; 3) If < low, reinforce/try again.
Why it matters: Stable progress across multi-step tasks. 🍞 Anchor: Only when the mug is clearly over the plate do we switch from “move” to “place.” —

03Methodology

At a high level: Observation + Instruction → Ground to keypoints → VLM writes differentiable stage rewards → Denoising with gradient nudges + diversity + resampling → Closed-loop adjustment and stage switching → Final action chunk. Step-by-step (like a recipe):

Input and grounding

What happens: The robot gets an RGB-D image and a sentence (e.g., “Place the mug on the green plate”). A VLM identifies relevant objects; SAM segments them; DINO features plus depth turn pixels into a 3D cloud; clustering picks a few task-relevant keypoints (mug base center, plate center, etc.).
Why it exists: Raw images are messy. Keypoints capture the geometry we need, compactly and robustly.
Example data: Keypoints P = {pmug_base = (x1,y1,z1), pplate_center = (x2,y2,z2)}.

Stage-aware reward generation

What happens: The VLM decomposes the task into stages (e.g., Stage 1: move above mug; Stage 2: grasp; Stage 3: move to plate; Stage 4: place). For each stage s, it outputs a small PyTorch function R_s that smoothly rewards desired relations (e.g., distance to target point, alignment of gripper, soft collision avoidance).
Why it exists: We need a number to tell how good a partial trajectory is, and a gradient to know which way to improve.
Example data: R_move_to_plate(a) = softmax(-||gripper_tip(a_T) - pplate_center||) + 0.2*alignment_bonus.

Diverse proposal initialization with repulsive forces

What happens: Sample B random action proposals from the diffusion/flow noise. Add a gentle repulsion term so candidates don’t pile up early.
Why it exists: Early diversity explores alternatives (center vs. edge approaches), which helps in clutter or weird layouts.
Example: With B=16, proposals spread around the plate instead of all aiming dead center.

Gradient-based refinement during denoising

What happens: For each denoising (or flow) step, compute ∇_a R_s and add it to the model’s predicted update (noise or velocity). Do a few inner updates (like tiny MCMC steps) to smooth out noise.
Why it exists: These are the tiny “turn-left-a-bit” nudges that align the trajectory with the current scene and goal.
Example: If distance to plate center is too big, the gradient points to reduce it; the next step shifts that way.

Gradient-free resampling via Feynman–Kac

What happens: Weight each candidate by exp(R_s), resample so high-reward ones are more likely to continue.
Why it exists: This accelerates convergence and drops poor options without getting stuck on a single idea too soon.
Example: Two best candidates that skirt clutter cleanly get duplicated; others that aim straight through an obstacle get removed.

Closed-loop control: adaptive guidance and stage switching

What happens: After an action chunk is formed, compute its reward. If it’s lower than earlier chunks, increase guidance next time; if higher, relax guidance to let the base policy finesse the motion. Check the Schmitt-trigger thresholds to decide if it’s time to switch stages or retry.
Why it exists: Keeps control stable and efficient across long tasks under real-world uncertainty (slips, bumps, new object positions).
Example: Once the mug is right over the plate (reward high), flip to “place” stage and reduce guidance for gentle lowering. Secret sauce (what makes this clever):
Programmatic, differentiable rewards written by a VLM turn open-world understanding into gradients the robot can use instantly.
Hybrid steering: precise gradient nudges + global resampling + enforced diversity.
Closed-loop stage logic with hysteresis avoids oscillations and micromanaging.
Training-free: The base policy stays frozen; all adaptation is during inference. Concrete walkthrough example:
Scene: Two plates (red, green) have swapped positions; instruction: “Place the mug on the green plate.”
Grounding: VLM finds mug, both plates; keypoints include centers; the green plate is now on the left.
Rewards: Stage 1 (approach mug): reward increases as gripper tip nears mug base center and aligns with grasp axis. Stage 3 (transfer): reward increases as the carried mug nears green-plate center while keeping a safe height.
Denoising: Start 16 proposals. Repulsion spreads them. Gradients pull toward green plate’s new spot. FK resampling multiplies clean approaches. Adaptive guidance eases near success. Schmitt trigger flips to “place” only when clearly above the green plate.
Output: A smooth, OOD-aware action chunk that succeeds without changing any policy weights.

04Experiments & Results

The test: Do robots succeed more often under OOD changes when we steer them with VLS? We measure success rate (did the task finish), episode length (how many steps it took), and inference time (latency). We test in simulation on CALVIN and LIBERO-PRO, and on a real Franka robot. The competition (baselines):

VLA models: OpenVLA, π, π-0.5, and a finetuned π-0.5 (LeRobot).
Steering methods: DynaGuide (feature-distance guidance), ITPS (human/VLM-in-the-loop policy steering). The scoreboard (with context):
CALVIN: VLS lifts success strongly. On movable objects (colored cubes), VLS hits about 94% average—like acing a test—versus the base policy’s much lower rate (about 7.4× improvement). On articulated parts (door, drawer, button, switch), VLS averages about 87%—roughly a 9.6× boost over base. Compared to DynaGuide and ITPS, VLS is ahead by roughly 15–25 percentage points, especially when object positions vary.
LIBERO-PRO: VLS improves frozen VLA policies under both position and task perturbations. Overall, VLS gives up to a 13% absolute gain compared to strong baselines across suites (Goal, Spatial, 10-Long, Object). Think of that like moving from a solid B to a strong A- when others slip to C in harder versions of the exam.
Real robot (Franka): In in-distribution setups, VLS improves average success by about 19% over the frozen baseline. Under OOD changes (new plate color, swapped plate positions, or replacing a banana with an unseen mug), the baseline often collapses, while VLS keeps going—achieving, for example, 40% success even on the hardest unseen-object swap where the baseline fails. Surprising findings:
Dense gradients are the star: Removing gradient guidance collapses success and lengthens episodes, showing that continuous, differentiable feedback is essential.
Diversity and FK help efficiency: Without repulsive forces or FK resampling, success drops a bit, but efficiency and stability suffer notably. The combo prevents early mode collapse and speeds convergence.
Compute–performance tradeoff: Larger sample batches raise success and shorten episodes but increase inference time. This makes VLS tunable: more compute for tougher scenes, less for easier ones. Why these results matter: They show that training-free, inference-time control can rival or beat methods that depend on learned critics, dynamics rolls, or heavy search—and it works in the messy real world, not just in clean simulators. — Ablation highlights (like taking parts out of a bike to see what breaks):
w/o Gradient Guidance: Big performance drop; episodes get long. This is the bike’s chain—remove it, you don’t move.
w/o FK Resampling: Success modestly down; runtime and stability worse. This is like losing your gear shifter—still rides, but less smooth.
w/o Repulsive Forces: Slight success drop; more time to converge. This is losing your bell—you still ride, but you miss helpful signals early. Scaling behavior:
Increasing batch size B improves success and reduces steps but raises latency. Pick B based on how fast you must react in your application. Overall takeaway: VLS brings a strong, practical gain on both benchmarks and a real robot, with understandable tradeoffs you can dial for your needs.

05Discussion & Limitations

Limitations (be specific):

Latency: Keeping a batch of candidates, doing inner refinements, and resampling adds milliseconds to hundreds of milliseconds, depending on batch size and steps. Real-time constraints may cap how big you can go.
VLM dependence: If the VLM misidentifies objects or writes awkward rewards, guidance can point the wrong way. Guardrails and prompt engineering help, but errors can sneak in.
Differentiable reward design: Rewards must be smooth and informative. Some goals (like exact contact physics) are hard to capture with simple, differentiable surrogates.
Very large OOD gaps: If the new task is far beyond the base policy’s skill set (e.g., tool use never seen), steering won’t invent missing skills. Required resources:
A frozen diffusion or flow-matching robot policy.
A VLM, segmentation (e.g., SAM), and feature extractor (e.g., DINOv2) with RGB-D input.
GPU/accelerator for fast denoising with batches and gradients; CPU is possible but slower. When not to use VLS:
Ultra-low-latency control (tight reflex loops) where any extra milliseconds are unacceptable.
Tasks demanding fine-grained non-differentiable objectives (e.g., brittle contact dynamics) unless you can approximate them smoothly.
Situations where the base policy lacks the core motor primitives—steering can’t add skills it never learned. Open questions:
Can we auto-check and refine the VLM’s rewards online (e.g., progress-aware prompts, self-consistency checks)?
Can we learn a tiny, general reward prior that speeds up VLM reward writing while staying training-free for the policy?
How to combine tactile/force feedback into differentiable rewards for better contact handling?
Can we compress compute via distillation or caching so small robots can run VLS on-board?
How does VLS scale to very long-horizon, multi-object tasks with many stage switches, and can we plan stage order automatically?

06Conclusion & Future Work

Three-sentence summary:

VLS adapts frozen diffusion/flow robot policies at inference time by using a VLM to turn open-world scenes and instructions into differentiable, stage-wise rewards.
These rewards provide gradients that steer denoising, while diversity and resampling keep exploration healthy, and closed-loop logic stabilizes long tasks.
Across challenging OOD tests in sim and on a real Franka robot, VLS boosts success significantly without any retraining. Main achievement:
Showing that programmatic, VLM-authored, differentiable rewards can reliably steer pretrained robot policies at test time, closing much of the OOD gap training used to handle. Future directions:
Cut latency via better sampling schedules, caching, or distilled guidance; add progress-aware reward generation; fold in richer sensing (e.g., force) for contact phases. Why remember this:
It flips the script from “retrain for every change” to “steer what you already know,” using a smart bridge—VLM-written rewards—between open-world understanding and low-level action generation. This makes robust, practical robot deployment far more attainable in the messy, changing real world.

Practical Applications

•Home assistants that can still tidy up when furniture is moved or new items appear.
•Factory pick-and-place robots that adapt to shifting bins and mixed parts without reprogramming.
•Warehouse robots that reroute grasps and placements when shelves are reorganized.
•Hospital delivery robots that handle new carts, trays, or instructions on the fly.
•Kitchen robots that place, pour, or fetch items despite clutter and changing counter layouts.
•Education and research platforms that test many tasks quickly without finetuning policies each time.
•Field robots that cope with weathered, dirty, or novel objects through VLM-informed rewards.
•Rapid prototyping: engineers swap goals in language and see immediate behavior changes.
•Assistive devices that adjust to each user’s environment and preferences via simple instructions.
•Quality control arms that adapt inspection paths to new product variants using keyed keypoints.

Version: 1