LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Shijie Lian; Bin Yu; Xiaopeng Lin; Laurence T. Yang; Zhaolong Shen; Changti Wu; Yuzhuo Miao; Cong Huang; Kai Chen

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Intermediate

Shijie Lian, Bin Yu, Xiaopeng Lin et al.1/21/2026

arXiv PDF

Key Summary

•Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.
•LangForce fixes this by splitting the robot’s brain into two parts: a vision-only guess (prior) and a vision-plus-words answer (posterior).
•It adds special Latent Action Queries—tiny learnable tokens—that ask the model, “What action details really depend on the instruction?”
•Training maximizes a log-likelihood ratio (LLR), which rewards actions that make the instruction more explainable than vision alone.
•This is the same as boosting pointwise mutual information (PMI) between actions and instructions, so the robot must truly listen.
•Without new data, LangForce improves out-of-distribution generalization on SimplerEnv by 11.3% over a strong baseline (66.5% vs. 55.2%).
•It also shines on ambiguous tasks in LIBERO Goal (99.4% vs. 97.4% baseline) and beats others on RoboCasa (52.6% average vs. 47.8%).
•The method preserves the model’s text-only reasoning better, avoiding the gibberish failures seen in standard fine-tuned VLA models.
•Training has two branches (a small overhead), but inference uses only one, so there’s no extra runtime cost at test time.

Why This Research Matters

Robots that truly listen to instructions are safer and more helpful in homes, hospitals, and factories. LangForce turns “follow the words” into a clear training rule, so models can’t coast on visual habits. This makes them handle ambiguous scenes better, like choosing between a drawer and a cabinet only when asked. It also helps them adapt to new places that don’t look like the training videos, which is essential for real-world deployment. By preserving text-only reasoning, it keeps the robot’s brain smarter and more conversational for everyday guidance. And because it needs no new data, teams can retrofit existing models to be more reliable without re-collecting demos.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you tell a robot, “Put the carrot on the plate,” but it keeps putting the spoon on the towel because that’s what the kitchen usually looks like in its training videos. It’s guessing from the scene instead of listening to you.

🥬 The Concept (Vision-Language-Action Models):

What it is: A Vision-Language-Action (VLA) model takes in pictures (vision) and words (language) to choose movements (actions).
How it works:
1. See the scene with a camera (vision tokens).
2. Read the instruction (“put carrot on plate”).
3. Fuse vision and language to predict the next robot arm motion.
Why it matters: Without understanding words, the robot can’t do what you ask when the same room could mean different tasks. 🍞 Anchor: If a room has a carrot, a plate, a towel, and a basket, only the language tells the robot which of many valid moves you actually want.

🍞 Hook: You know how a friend might always order pizza at a certain restaurant without reading the menu? That’s a shortcut based on the place, not the request.

🥬 The Concept (Vision Shortcut / Information Collapse):

What it is: In many robot datasets, the scene almost fully predicts the task, so the model learns to ignore the instruction text.
How it works:
1. Datasets are goal-driven: similar scenes are paired with the same task labels again and again.
2. The model can predict the task words from the picture alone.
3. The extra benefit of language vanishes (conditional mutual information collapses), so the model behaves like vision-only.
Why it matters: When the scene changes (new kitchen) or multiple tasks are possible in the same scene, the robot fails because it never learned to use your words. 🍞 Anchor: On LIBERO Goal, the same counter could mean “put bowl on stove” or “put bowl in drawer.” If the robot ignores language, it messes up.

The World Before: VLA models looked great on familiar environments. They were trained with big vision-language datasets and then fine-tuned on robot demos. But because demos were collected in a goal-driven way (people repeat the same task in the same setting), language became predictable from vision. So models quietly learned a vision-only policy p(a|v).

The Problem: When instructions are truly needed to choose between several valid actions in the same scene, these models stumble—especially out-of-distribution (OOD), where the visuals differ.

Failed Attempts:

Just adding more demos often repeats the same biases.
Freezing large parts of the model preserves language skills but doesn’t fix the shortcut.
Training bigger models can memorize even faster without learning the real language-to-action link.

The Gap: We needed a way to force the policy to depend on language—even when the data alone nudges it to depend on vision.

Real Stakes:

Home robots might open the cabinet when you asked for the drawer.
Warehouse bots could misplace items when shelves look similar.
In new environments, a robot that guessed from looks will fail; a robot that listens will adapt.

02Core Idea

🍞 Hook: Imagine a teacher grading two answers. One student copies from the picture; the other explains using both the picture and the instructions. The teacher gives extra credit if the words really matter.

🥬 The Concept (Key Insight):

What it is: Split the policy into a vision-only prior and a vision+language posterior, then reward the model when actions make the instruction more likely than vision alone (maximize LLR/PMI).
How it works:
1. Build a vision-only path p(a|v).
2. Build a vision+language path π(a|v,ℓ).
3. Use Latent Action Queries as a compact interface from the VLM to the action head.
4. Maximize log π(a|v,ℓ) − log p(a|v), which equals log p(ℓ|a,v) − log p(ℓ|v).
Why it matters: This punishes solutions that ignore language and rewards actions that truly explain the instruction. 🍞 Anchor: If the instruction is “Put carrot on plate,” the model gets points only when its chosen movement makes that exact instruction more explainable than vision alone.

Three Analogies:

Detective vs. guesser: The guesser says, “Looks like a robbery” from first glance; the detective uses both the scene and a witness statement. LangForce rewards the detective.
Recipe vs. fridge peek: Glancing in a fridge might suggest pasta, but the recipe card says salad. LangForce checks if your cooking steps actually fit the recipe, not just the fridge view.
Map vs. road signs: You might think “turn right” from a familiar road shape, but the sign (language) says “detour left.” LangForce favors actions that match the sign.

🍞 Hook: You know how you can take apart a LEGO build to see which pieces do what?

🥬 The Concept (Bayesian Decomposition):

What it is: A way to split the policy into a vision-only prior and a language-conditioned posterior using probability rules.
How it works:
1. Prior p(a|v): what actions are likely just from the scene.
2. Likelihood p(ℓ|a,v): how well a chosen action explains the instruction.
3. Posterior π(a|v,ℓ): combine them so actions fit both the scene and the words.
Why it matters: If the prior dominates, models ignore language; the decomposition lets us measure and fix that. 🍞 Anchor: In a kitchen, p(a|v) may prefer “open cabinet,” but given ℓ = “open drawer,” the posterior must shift toward the drawer action.

🍞 Hook: Picture tiny helpers whispering, “Which exact move matches the words?”

🥬 The Concept (Latent Action Queries):

What it is: Learnable tokens added to the VLM that bottleneck and carry only the action-relevant summary to the action head.
How it works:
1. Insert K small query tokens Q into the VLM’s input.
2. Position them so they can see vision only (prior) or vision+language (posterior).
3. Send only their hidden states to the action module.
Why it matters: This cleanly separates vision-only and vision+language information and makes action prediction depend on what the tokens captured. 🍞 Anchor: Think of Q as a small notepad where the model writes the few details needed to actually move the robot arm—the notepad contents change if it reads the instruction.

🍞 Hook: You know how two facts can be strongly connected, like thunder and lightning?

🥬 The Concept (Pointwise Mutual Information, PMI):

What it is: A score of how much one thing (action) tells you about another (instruction), given the scene.
How it works:
1. Compare “instruction given action+vision” to “instruction given vision.”
2. If the action really follows the instruction, the first is much higher.
3. The difference is PMI.
Why it matters: High PMI means the action encodes the instruction instead of guessing from the picture. 🍞 Anchor: If you move toward the plate with the carrot, that action strongly points to the instruction “put carrot on plate.”

Before vs. After:

Before: Policies often collapsed into p(a|v), succeeding only when scenes matched training quirks.
After: Policies maximize PMI/LLR, so they must reflect what was said, not just what was seen—better OOD and ambiguous-task performance.

Why It Works (Intuition):

The model gets credit only if its chosen actions make the exact instruction more predictable than vision alone. That forces it to extract and use the semantic bits in the words that vision can’t supply.

Building Blocks:

Vision-only Prior (sandwich above), Language-conditioned Posterior (same idea), Latent Action Queries, PMI/LLR maximization, and a dual-branch setup that trains both paths together but only uses the posterior at test time.

03Methodology

High-level Overview: Input (images + instruction) → VLM encodes tokens → Insert Latent Action Queries Q → Two branches (prior: vision-only; posterior: vision+language) → Action head (Diffusion Transformer) predicts continuous actions → Train with flow matching + LLR objective → At test time, use only posterior branch.

🍞 Hook: Think of a two-lane training road—one lane sees only the scenery, the other sees scenery plus road signs (instructions).

🥬 The Concept (Dual-Branch Architecture):

What it is: A training setup with two parallel passes through the same VLM weights but different token orders.
How it works:
1. Prior branch input: [vision, Q, language]. Because of causal masking, Q can’t see language—so Q summarizes vision-only.
2. Posterior branch input: [vision, language, Q]. Now Q can see both—so Q summarizes vision+language.
3. Only Q’s hidden states go to the action model; this is the bottleneck.
Why it matters: It lets us estimate p(a|v) and π(a|v,ℓ) cleanly, and compute a meaningful LLR signal. 🍞 Anchor: It’s like running the same movie twice: once with subtitles hidden (prior) and once with subtitles shown (posterior), then comparing what you learned.

Step-by-step (like a recipe):

Tokenize vision frames and the instruction into the VLM.
Insert K=64 Latent Action Query tokens Q.
Prior pass: order tokens so Q attends to vision only; send HQ_prior to the action head; train with a flow-matching loss to learn typical actions in this scene.
Posterior pass: order tokens so Q attends to both vision and language; send HQ_post to the action head; train with a main flow-matching loss to match expert actions.
Compute LLR: encourage log p(ℓ | v, HQ_prior) to be higher than a no-action baseline log p(ℓ | v) (with stop-gradient on the baseline). This pushes Q to carry instruction-explaining info.
Total loss: combine posterior flow loss, prior flow loss (weighted by λ), and the LLR term (weighted by β).
Inference: run only the posterior pass (no extra test-time cost) and generate continuous actions from the Diffusion Transformer.

🍞 Hook: You know how a sketch can be turned into a smooth drawing by learning the direction to move your pencil?

🥬 The Concept (Rectified Flow Matching):

What it is: A training method where a model learns a “velocity field” that moves noisy actions toward real expert actions.
How it works:
1. Mix a ground-truth action with random noise at time t.
2. Predict the velocity to go from the noisy point toward the clean action.
3. Repeat across times so the model learns to denoise into the right action.
Why it matters: It generates stable, precise continuous actions—important for smooth robot control. 🍞 Anchor: It’s like guiding a pen from a blurry scribble to a clean trace, step by step.

🍞 Hook: Picture a translator that only accepts a tiny summary card from the VLM and still has to speak perfectly.

🥬 The Concept (Diffusion Transformer Action Head):

What it is: The action generator that reads just HQ (the queries’ hidden states) and outputs robot actions.
How it works:
1. Conditions on HQ_post (or HQ_prior during training).
2. Predicts a step-by-step velocity to denoise actions.
3. Produces a continuous trajectory the robot can execute.
Why it matters: By only reading HQ, it forces the VLM to compress what truly matters for control, keeping the architecture efficient. 🍞 Anchor: Instead of reading the whole book (all tokens), the action head reads a postcard (HQ) with only the essential notes.

Concrete Example (with actual tasks):

Instruction: “Put carrot on plate.” Scene: plate, towel, carrot, eggplant.
Prior branch learns common moves in this scene (maybe it often saw “put spoon on towel” in similar kitchens).
Posterior branch must pick the action that fits the words, steering toward the plate with the carrot.
LLR gives credit only if the chosen action makes the instruction more predictable than vision alone, so the model can’t rely on the spoon-towel habit.

Secret Sauce:

The bottleneck (Latent Action Queries) plus the token ordering trick creates a clean separation between vision-only and vision+language contexts.
The LLR term turns that separation into a learning signal that directly punishes ignoring language.

04Experiments & Results

The Test: The team evaluated how well policies follow instructions when scenes are tricky or new—exactly where vision-only guessing fails. They measured success rate: how often the robot completes the task.

The Competition: LangForce was compared to strong VLA baselines like QwenGR00T, OpenVLA-OFT, π-series (flow/FAST), Isaac-GR00T variants, CogACT, SpatialVLA, VideoVLA, and others.

Scoreboard with context:

SimplerEnv (OOD simulation using policies trained on BridgeDataV2 + Fractal): • Baseline QwenGR00T: 55.2% average success. • LangForce: 66.5% average—an 11.3% absolute gain. That’s like raising your grade from a solid B to an A, without new study material. • Biggest boosts on tasks where language matters: “Put Carrot on Plate” (+13.6%) and “Put Eggplant in Yellow Basket” (+15.0%).
LIBERO (Spatial, Object, Goal, Long): • Most suites are already near-saturated by baselines (>95%). • Goal (ambiguous scenes): LangForce 99.4% vs. QwenGR00T 97.4%. In a test built to force instruction use, LangForce leads. • Conditional entropy proxy (instruction NLL given vision): Higher for LangForce (9.47 nats/token vs. 8.51), meaning it preserves the rightful uncertainty when only vision is known—exactly the opposite of the shortcut.
RoboCasa (24 tabletop manipulation tasks): • Vision-only baseline surprisingly strong at 44.7%, showing shortcut prevalence. • QwenGR00T: 47.8%. • LangForce: 52.6%—top result across methods. On tasks like “PnP Novel From Placemat To Bowl,” LangForce 62.0% vs. Vision-only 32.0% and QwenGR00T 44.0%.

Surprising Findings:

Training loss can look similar for vision-only vs. full VLA on in-the-wild data, yet vision-only fails almost completely OOD. This shows how deceptive shortcuts can be.
LangForce not only improves control generalization but also better preserves the model’s text-only reasoning, avoiding the “gibberish drift” seen in standard fine-tuning.

Meaning: The LLR/PMI objective effectively breaks the vision shortcut and makes the policy truly instruction-grounded, which pays off most on ambiguous and OOD tests.

05Discussion & Limitations

Limitations:

Training-time overhead: two branches are computed each step. In practice, reusing the shared vision prefix keeps the extra cost small, but it’s still more than single-branch training.
Some vision-language abilities (image+text reasoning) may degrade as the visual tower adapts for precise control, even though text-only skills are better preserved.

Required Resources:

A modern VLM backbone (e.g., Qwen3-VL-4B-level), GPU budget (e.g., 8×H100 in the paper), and datasets like BridgeDataV2/Fractal or task-specific demos.
Infrastructure for diffusion/flow-matching training and careful token ordering/masking.

When NOT to Use:

If tasks are fully determined by vision and there’s no real need for language (no ambiguity), the extra LLR machinery might bring limited gains.
If compute is extremely constrained at training time, the dual-branch setup may be too heavy.

Open Questions:

Can the same Bayesian idea be implemented via world models (imagining futures) to further resist shortcuts?
How large should the query set be across robots and tasks? Are adaptive or dynamic queries better than a fixed K?
Can we design data collection that naturally raises H(ℓ|v), reducing the need for algorithmic fixes?
How to retain full multimodal generality (image+text reasoning) while specializing for control?
What’s the best way to transfer this idea to audio-language-action or other embodied modalities?

06Conclusion & Future Work

3-Sentence Summary: LangForce tackles the vision shortcut, where robots ignore instructions because datasets make language too predictable from vision. It splits learning into a vision-only prior and a language-aware posterior, then maximizes a log-likelihood ratio so actions must truly explain the words. With Latent Action Queries and a dual-branch design, it boosts OOD success (e.g., +11.3% on SimplerEnv) without extra test-time cost.

Main Achievement: A practical, end-to-end Bayesian decomposition with an LLR/PMI objective that measurably forces instruction grounding and breaks the shortcut.

Future Directions:

Scale to larger VLMs and more diverse benchmarks (e.g., real robots, RoboTwin 2.0, BEHAVIOR-1K).
Explore world-model-based decompositions that may further resist information collapse.
Refine query mechanisms and data collection to raise H(ℓ|v) naturally.

Why Remember This: It turns “listen to the words” into a measurable training rule. By rewarding actions that make the instruction more likely than vision alone, LangForce moves robots closer to truly following human intent in the messy real world.

Practical Applications

•Retrofit existing VLA robots with LangForce training to boost instruction following on current datasets.
•Deploy in homes where identical-looking kitchens demand precise language grounding (drawer vs. cabinet).
•Use in warehouses to place items correctly when shelf layouts are ambiguous or new.
•Assistive robots in hospitals to follow nurse instructions exactly (e.g., which tray, which room).
•Quality-check training pipelines: monitor LLR/PMI to detect when models start ignoring language.
•Dataset design: deliberately include ambiguous scenes to raise H(ℓ|v) and reduce shortcuts.
•Teach multi-step tasks by ensuring each sub-action increases p(ℓ|a,v) over p(ℓ|v).
•Port the idea to audio-language-action (e.g., voice commands in noisy environments).
•Evaluate OOD readiness by comparing success with and without language to expose shortcuts.
•Preserve conversational ability in robotics assistants by enforcing language dependence during control training.

Version: 1